Introduction
This file contains essential commands from the chapters of r4ds and corresponding examples. A command is considered “essential” when you really need to know it and need to know how to use it to succeed in this course.
All ds4psy essentials:
| Nr. | Topic |
|---|---|
| 1. | Creating and using tibbles |
| 2. | Data transformation |
| 3. | Visualizing data |
| 4. | Exploring data |
| 5. | Tidy data |
Course coordinates
- Course Data Science for Psychologists (ds4psy).
- Taught at the University of Konstanz by Hansjörg Neth (h.neth@uni.kn, SPDS, office D507).
- Spring/summer 2018: Mondays, 13:30–15:00, C511.
- Links to ZeUS and Ilias
Preparations
Create an R script (.R) or an R-Markdown file (.Rmd) and load the R packages of the tidyverse. (Hint: Structure your script by inserting spaces, meaningful comments, and sections.)
## Essential commmands | Data science for psychologists
## 2018 06 24
## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ##
## Preparations: -----
library(tidyverse)
## Topic: -----
# ...
## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ##
## End of file. ----- Tibbles
Data can be found in the form of individual data points (so-called scalars, which can be of different types) or longer sequences of values (lists or vectors). However, most of the time we are dealing with datasets that contain multiple rows and columns (2-dimensional matrices or data frames, or multi-dimensional arrays).
Whenever working with rectangular data structures – data consisting of multiple cases (rows) and multiple variables (columns) – our first step in this course is to create or transform the data into a tibble. A tibble is defined by the package tibble and implements a particular type of data table (or a simpler version of a data frame, which is the most common data structure in R).
Creating tibbles
How we create tibbles depends on the form in which we encounter or obtain our data.
Basic commands
There are 3 basic commands for creating tibbles:
as_tibbleconverts (or coerces) an existing data frame into a tibble.tibbleconverts several vectors into (the columns of) a tibble.tribbleconverts a table (entered row-by-row) into a tibble.
Check: The 3 commands yield the same type of output (i.e., a tibble), but require different inputs. Ask yourself which kind of input each command takes and how this input needs to be structured and formatted (e.g., with commas).
1. as_tibble
Use as_tibble when the data to be used already is in a data frame (or matrix):
## Using the data frame `sleep`: ------
# ?datasets::sleep # provides background information on the data set.
# Save the sleep data frame as df:
df <- datasets::sleep
# Convert df into a tibble tb:
tb <- as_tibble(df)
# Inspect the data frame df:
dim(df)
#> [1] 20 3
is.data.frame(df)
#> [1] TRUE
head(df)
#> extra group ID
#> 1 0.7 1 1
#> 2 -1.6 1 2
#> 3 -0.2 1 3
#> 4 -1.2 1 4
#> 5 -0.1 1 5
#> 6 3.4 1 6
str(df)
#> 'data.frame': 20 obs. of 3 variables:
#> $ extra: num 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
#> $ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
#> $ ID : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
# Inspect the tibble tb:
dim(tb)
#> [1] 20 3
is.tibble(tb)
#> [1] TRUE
is.data.frame(tb) # => tibbles ARE data frames.
#> [1] TRUE
head(tb)
#> # A tibble: 6 x 3
#> extra group ID
#> <dbl> <fctr> <fctr>
#> 1 0.7 1 1
#> 2 -1.6 1 2
#> 3 -0.2 1 3
#> 4 -1.2 1 4
#> 5 -0.1 1 5
#> 6 3.4 1 6
glimpse(tb)
#> Observations: 20
#> Variables: 3
#> $ extra <dbl> 0.7, -1.6, -0.2, -1.2, -0.1, 3.4, 3.7, 0.8, 0.0, 2.0, 1....
#> $ group <fctr> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
#> $ ID <fctr> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, ...Practice: Convert the data frames datasets::attitude and datasets::iris into tibbles and inspect their dimensions and contents. What types of variables do they contain?
2. tibble
Use tibble when the data to be used appears as a collection of columns. For instance, imagine we have the following information about a family:
| id | name | age | gender | drives | married_2 |
|---|---|---|---|---|---|
| 1 | Adam | 46 | male | TRUE | Eva |
| 2 | Eva | 48 | female | TRUE | Adam |
| 3 | Xaxi | 21 | female | FALSE | Zenon |
| 4 | Yota | 19 | female | TRUE | NA |
| 5 | Zack | 17 | male | FALSE | NA |
One way of viewing this table is as a series of columns. Each column consists of a variable name and the same number of (here: 5) values, which can be of different types (here: numbers, characters, or Boolean truth values). Each column may or may not contain missing values (entered as NA).
The tibble command expects that each column of the table is entered as a vector:
## Create a tibble from vectors (column-by-column):
fm <- tibble(
id = c(1, 2, 3, 4, 5), # OR: id = 1:5,
name = c("Adam", "Eva", "Xaxi", "Yota", "Zack"),
age = c(46, 48, 21, 19, 17),
gender = c("male", rep("female", 3), "male"),
drives = c(TRUE, TRUE, FALSE, TRUE, FALSE),
married_2 = c("Eva", "Adam", "Zenon", NA, NA)
)
fm # prints the tibble:
#> # A tibble: 5 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 3 Xaxi 21 female FALSE Zenon
#> 4 4 Yota 19 female TRUE <NA>
#> 5 5 Zack 17 male FALSE <NA>Note some details:
Each vector is labeled by the variable (column) name, which is not put into quotes;
Avoid spaces within variable (column) names (or enclose names in single quotes if you really
must use spaces);All vectors need to have the same length;
Each vector is of a single type (numeric, character, or Boolean truth values);
Consecutive vectors are separated by commas (but there is no comma after the final vector).
When using tibble, later vectors may use the values of earlier vectors:
# Using earlier vectors when defining later ones:
abc <- tibble(
ltr = LETTERS[1:5],
num = 1:5,
l_n = paste(ltr, num, sep = "_"), # combining abc with num
nsq = num^2 # squaring num
)
abc # prints the tibble:
#> # A tibble: 5 x 4
#> ltr num l_n nsq
#> <chr> <int> <chr> <dbl>
#> 1 A 1 A_1 1
#> 2 B 2 B_2 4
#> 3 C 3 C_3 9
#> 4 D 4 D_4 16
#> 5 E 5 E_5 25Practice: Find some tabular data online (e.g., on Wikipedia) and enter it as a tibble.
3. tribble
Use tribble when the data to be used appears as a collection of rows (or already is in tabular form).
For instance, when you copy and paste the above family data from an electronic document, it is easy to insert commas between consecutive cell values and use tribble to convert it into a tibble:
## Create a tibble from tabular data (row-by-row):
fm2 <- tribble(
~id, ~name, ~age, ~gender, ~drives, ~married_2,
#--|------|-----|--------|----------|----------|
1, "Adam", 46, "male", TRUE, "Eva",
2, "Eva", 48, "female", TRUE, "Adam",
3, "Xaxi", 21, "female", FALSE, "Zenon",
4, "Yota", 19, "female", TRUE, NA,
5, "Zack", 17, "male", FALSE, NA )
fm2 # prints the tibble:
#> # A tibble: 5 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 3 Xaxi 21 female FALSE Zenon
#> 4 4 Yota 19 female TRUE <NA>
#> 5 5 Zack 17 male FALSE <NA>Note some details:
The column names are preceded by
~;Consecutive entries are separated by a comma (but there is no comma after the final entry);
The line
#--|------|-----|--------|----------|----------|is commented out and can be omitted;The type of each column is determined by the type of the corresponding cell values. For instance, the NA values in
fm2are missing character values because the entries above were characters (entered in quotes).
Check: If tibble and tribble really are alternative commands, then the contents of our objects fm and fm2 should be identical:
# Are fm and fm2 equal?
all.equal(fm, fm2)
#> [1] TRUEPractice: Enter the tibble abc by using tribble.
Accessing parts of a tibble
Once we have an R object that is a tibble, we often want to access individual parts of it. We can distinguish between 3 simple cases:
1. Variables (columns)
As each column of a tibble is a vector, obtaining a column amounts to obtaining the corresponding vector. We can access this vector by its name (label) or by its number (column position):
fm # family tibble (defined above):
#> # A tibble: 5 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 3 Xaxi 21 female FALSE Zenon
#> 4 4 Yota 19 female TRUE <NA>
#> 5 5 Zack 17 male FALSE <NA>
# Get the name column of fm:
fm$name # by label (with $)
#> [1] "Adam" "Eva" "Xaxi" "Yota" "Zack"
fm[["name"]] # by label (with [])
#> [1] "Adam" "Eva" "Xaxi" "Yota" "Zack"
fm[[2]] # by number (with [])
#> [1] "Adam" "Eva" "Xaxi" "Yota" "Zack"
# Get the age column of fm:
fm$age # by name (with $)
#> [1] 46 48 21 19 17
fm[["age"]] # by name (with [])
#> [1] 46 48 21 19 17
fm[[3]] # by number (with [])
#> [1] 46 48 21 19 17
# Note: The following all yield the same vectors as a tibble:
fm[ , 2] # yields the name vector as a (5 x 1) tibble
#> # A tibble: 5 x 1
#> name
#> <chr>
#> 1 Adam
#> 2 Eva
#> 3 Xaxi
#> 4 Yota
#> 5 Zack
select(fm, 2)
#> # A tibble: 5 x 1
#> name
#> <chr>
#> 1 Adam
#> 2 Eva
#> 3 Xaxi
#> 4 Yota
#> 5 Zack
select(fm, name)
#> # A tibble: 5 x 1
#> name
#> <chr>
#> 1 Adam
#> 2 Eva
#> 3 Xaxi
#> 4 Yota
#> 5 Zack
fm[ , 3] # yields the age vector as a (5 x 1) tibble
#> # A tibble: 5 x 1
#> age
#> <dbl>
#> 1 46
#> 2 48
#> 3 21
#> 4 19
#> 5 17
select(fm, 3)
#> # A tibble: 5 x 1
#> age
#> <dbl>
#> 1 46
#> 2 48
#> 3 21
#> 4 19
#> 5 17
select(fm, age)
#> # A tibble: 5 x 1
#> age
#> <dbl>
#> 1 46
#> 2 48
#> 3 21
#> 4 19
#> 5 17Practice: Extract the price column of ggplot2::diamonds in at least 3 different ways and verify that they all yield the same mean price.
2. Cases (rows)
Extracting specific rows of a tibble amounts to filtering a tibble and typically yields smaller tibbles (as a row may contain entries of different types). The best way of filtering specific rows of a tibble is using dplyr::filter. However, it’s also possible to specify the desired rows by subsetting (i.e., specifying a condition that results in a Boolean value) and by row number:
fm # family tibble (defined above):
#> # A tibble: 5 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 3 Xaxi 21 female FALSE Zenon
#> 4 4 Yota 19 female TRUE <NA>
#> 5 5 Zack 17 male FALSE <NA>
# Filter specific rows (by condition):
filter(fm, id > 2)
#> # A tibble: 3 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 3 Xaxi 21 female FALSE Zenon
#> 2 4 Yota 19 female TRUE <NA>
#> 3 5 Zack 17 male FALSE <NA>
filter(fm, age < 18)
#> # A tibble: 1 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 5 Zack 17 male FALSE <NA>
fm %>% filter(drives == TRUE)
#> # A tibble: 3 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 4 Yota 19 female TRUE <NA>
# The same filters by using Boolean vectors (subsetting):
fm[fm$id > 2, ]
#> # A tibble: 3 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 3 Xaxi 21 female FALSE Zenon
#> 2 4 Yota 19 female TRUE <NA>
#> 3 5 Zack 17 male FALSE <NA>
fm[fm$age < 18, ]
#> # A tibble: 1 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 5 Zack 17 male FALSE <NA>
fm[fm$drives == TRUE, ]
#> # A tibble: 3 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 4 Yota 19 female TRUE <NA>
# The same filters by providing specific row numbers:
fm[3:5, ] # getting rows 3 to 5 of fm
#> # A tibble: 3 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 3 Xaxi 21 female FALSE Zenon
#> 2 4 Yota 19 female TRUE <NA>
#> 3 5 Zack 17 male FALSE <NA>
fm[5, ] # getting row 5 of fm
#> # A tibble: 1 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 5 Zack 17 male FALSE <NA>
fm[c(1, 2, 4), ] # getting rows 1, 2, and 4 of fm
#> # A tibble: 3 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 4 Yota 19 female TRUE <NA>Practice: Extract all diamonds from ggplot2::diamonds that have at least 2 carat. How many of them are there and what is their mean price?
3. Cells
Accessing the values of individual tibble cells is relatively rare, but can be achieved by
a. explicitly providing both row number `r` and column number `c` (as `[r, c]`), or by
b. first extracting the column (as a vector `v`) and then providing the desired row number `r` (`v[r]`).
fm # family tibble (defined above):
#> # A tibble: 5 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 3 Xaxi 21 female FALSE Zenon
#> 4 4 Yota 19 female TRUE <NA>
#> 5 5 Zack 17 male FALSE <NA>
# Getting specific cell values:
fm$name[4] # getting the name of the 4th row
#> [1] "Yota"
fm[4, 2] # getting the same name by row and column numbers
#> # A tibble: 1 x 1
#> name
#> <chr>
#> 1 Yota
# Note: What if we don't know the row number?
which(fm$name == "Yota") # getting the row number that contains the name "Yota"
#> [1] 4In practice, accessing individual cell values is mostly needed to check for specific cell values and to change or correct erroneous entries by re-assigning them to a different value.
# Checking and changing cell values:
# Check: "Who is Xaxi's spouse?" in 3 different ways:
fm[fm$name == "Xaxi", ]$married_2
#> [1] "Zenon"
fm$married_2[3]
#> [1] "Zenon"
fm[3, 6]
#> # A tibble: 1 x 1
#> married_2
#> <chr>
#> 1 Zenon
# Change: "Zenon" is actually "Zeus" in 3 different ways:
fm[fm$name == "Xaxi", ]$married_2 <- "Zeus"
fm$married_2[3] <- "Zeus"
fm[3, 6] <- "Zeus"
# Check for successful change:
fm
#> # A tibble: 5 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 3 Xaxi 21 female FALSE Zeus
#> 4 4 Yota 19 female TRUE <NA>
#> 5 5 Zack 17 male FALSE <NA>By contrast, a relatively common task is to check an entire tibble for missing values, count them, or replace them by some other value:
# Checking for, counting, and changing missing values:
fm # family tibble (defined above):
#> # A tibble: 5 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 3 Xaxi 21 female FALSE Zeus
#> 4 4 Yota 19 female TRUE <NA>
#> 5 5 Zack 17 male FALSE <NA>
# (a) Check for missing values:
is.na(fm) # checks each cell value for being NA
#> id name age gender drives married_2
#> [1,] FALSE FALSE FALSE FALSE FALSE FALSE
#> [2,] FALSE FALSE FALSE FALSE FALSE FALSE
#> [3,] FALSE FALSE FALSE FALSE FALSE FALSE
#> [4,] FALSE FALSE FALSE FALSE FALSE TRUE
#> [5,] FALSE FALSE FALSE FALSE FALSE TRUE
# (b) Count the number of missing values:
sum(is.na(fm)) # counts missing values (by adding up all TRUE values)
#> [1] 2
# (c) Change all missing values:
fm[is.na(fm)] <- "A MISSING value!"
# Check for successful change:
fm
#> # A tibble: 5 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 3 Xaxi 21 female FALSE Zeus
#> 4 4 Yota 19 female TRUE A MISSING value!
#> 5 5 Zack 17 male FALSE A MISSING value!Practice: Determine the number and the percentage of missing values in the datasets dplyr::starwars and dplyr::storms.
More advanced operations on tibbles are covered in Chapter 5: Data transformation and involve using the dplyr commands arrange, filter, and select.
More on tibbles
For more details on tibbles,
- study
vignette("tibble")and the documentation for?tibble; - study https://tibble.tidyverse.org/ and its examples;
- read Chapter 10: Tibbles and complete its exercises.
Data transformation
Overview
When we have data in the form of a tibble or data frame, dplyr provides a range of simple tools to transform this data. Six essential dplyr commands are:
arrangesorts cases (rows);filterselects cases (rows) by logical conditions;selectselects and reorders variables (columns);mutatecomputes new variables (columns) and adds them to existing ones;summarisecollapses multiple values of a variable (rows of a column) to a single one;
group_bychanges the unit of aggregation (in combination withmutateandsummarise).
Not quite as essential but still useful dplyr commands include:
sliceselects (ranges of) cases (rows) by number;renamerenames variables (columns) and keeps others;transmutecomputes new variables (columns) and drops existing ones;sample_nandsample_fracdraw random samples of cases (rows).
Commands and examples
We save the dplyr::starwars data as a tibble sw and use it to illustrate the essential dplyr commands.
library(tidyverse)
sw <- dplyr::starwars
sw # => A tibble: 87 rows (individuals) x 13 columns (variables)
#> # A tibble: 87 x 13
#> name height mass hair_color skin_color eye_color
#> <chr> <int> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 blond fair blue
#> 2 C-3PO 167 75 <NA> gold yellow
#> 3 R2-D2 96 32 <NA> white, blue red
#> 4 Darth Vader 202 136 none white yellow
#> 5 Leia Organa 150 49 brown light brown
#> 6 Owen Lars 178 120 brown, grey light blue
#> 7 Beru Whitesun lars 165 75 brown light blue
#> 8 R5-D4 97 32 <NA> white, red red
#> 9 Biggs Darklighter 183 84 black light brown
#> 10 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray
#> # ... with 77 more rows, and 7 more variables: birth_year <dbl>,
#> # gender <chr>, homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>Practice: How many sw variables (columns) are there and of which type are they? How many missing (NA) values are there?
1. arrange to sort rows
Using arrange sorts cases (rows) by putting specific variables (columns) in specific orders (e.g., ascending or descending):
# Sort rows alphabetically (by name):
arrange(sw, name)
#> # A tibble: 87 x 13
#> name height mass hair_color skin_color
#> <chr> <int> <dbl> <chr> <chr>
#> 1 Ackbar 180 83 none brown mottle
#> 2 Adi Gallia 184 50 none dark
#> 3 Anakin Skywalker 188 84 blond fair
#> 4 Arvel Crynyd NA NA brown fair
#> 5 Ayla Secura 178 55 none blue
#> 6 Bail Prestor Organa 191 NA black tan
#> 7 Barriss Offee 166 50 black yellow
#> 8 BB8 NA NA none none
#> 9 Ben Quadinaros 163 65 none grey, green, yellow
#> 10 Beru Whitesun lars 165 75 brown light
#> # ... with 77 more rows, and 8 more variables: eye_color <chr>,
#> # birth_year <dbl>, gender <chr>, homeworld <chr>, species <chr>,
#> # films <list>, vehicles <list>, starships <list>
# The same command using the pipe:
sw %>% # Note: %>% is NOT + (used in ggplot)
arrange(name)
#> # A tibble: 87 x 13
#> name height mass hair_color skin_color
#> <chr> <int> <dbl> <chr> <chr>
#> 1 Ackbar 180 83 none brown mottle
#> 2 Adi Gallia 184 50 none dark
#> 3 Anakin Skywalker 188 84 blond fair
#> 4 Arvel Crynyd NA NA brown fair
#> 5 Ayla Secura 178 55 none blue
#> 6 Bail Prestor Organa 191 NA black tan
#> 7 Barriss Offee 166 50 black yellow
#> 8 BB8 NA NA none none
#> 9 Ben Quadinaros 163 65 none grey, green, yellow
#> 10 Beru Whitesun lars 165 75 brown light
#> # ... with 77 more rows, and 8 more variables: eye_color <chr>,
#> # birth_year <dbl>, gender <chr>, homeworld <chr>, species <chr>,
#> # films <list>, vehicles <list>, starships <list>
# Sort rows in descending order:
sw %>%
arrange(desc(name))
#> # A tibble: 87 x 13
#> name height mass hair_color skin_color
#> <chr> <int> <dbl> <chr> <chr>
#> 1 Zam Wesell 168 55 blonde fair, green, yellow
#> 2 Yoda 66 17 white green
#> 3 Yarael Poof 264 NA none white
#> 4 Wilhuff Tarkin 180 NA auburn, grey fair
#> 5 Wicket Systri Warrick 88 20 brown brown
#> 6 Wedge Antilles 170 77 brown fair
#> 7 Watto 137 NA black blue, grey
#> 8 Wat Tambor 193 48 none green, grey
#> 9 Tion Medon 206 80 none grey
#> 10 Taun We 213 NA none grey
#> # ... with 77 more rows, and 8 more variables: eye_color <chr>,
#> # birth_year <dbl>, gender <chr>, homeworld <chr>, species <chr>,
#> # films <list>, vehicles <list>, starships <list>
# Sort by multiple variables:
sw %>%
arrange(eye_color, gender, desc(height))
#> # A tibble: 87 x 13
#> name height mass hair_color skin_color eye_color
#> <chr> <int> <dbl> <chr> <chr> <chr>
#> 1 Taun We 213 NA none grey black
#> 2 Shaak Ti 178 57 none red, blue, white black
#> 3 Lama Su 229 88 none grey black
#> 4 Tion Medon 206 80 none grey black
#> 5 Kit Fisto 196 87 none green black
#> 6 Plo Koon 188 80 none orange black
#> 7 Greedo 173 74 <NA> green black
#> 8 Nien Nunb 160 68 none grey black
#> 9 Gasgano 122 NA none white, blue black
#> 10 BB8 NA NA none none black
#> # ... with 77 more rows, and 7 more variables: birth_year <dbl>,
#> # gender <chr>, homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
## Note: See
# ?dplyr::arrange # for more help and examples.Note some details:
All basic
dplyrcommands can be called asverb(data, ...)or – using the pipe frommagrittr– asdata %>% verb(...)(seevignette("magrittr")for details).Variable names are unquoted.
The order of variable names (
x, y, ...) specifies the order or priority of operations (first byx, then byy, etc.).
Practice: Arrange the sw data in different ways, combining multiple variables and (ascending and descending) orders. Where are cases containing NA values in sorted variables placed?
2. filter to select rows
Using filter selects cases (rows) by logical conditions. It keeps all rows for which the conditions are TRUE and drops all rows for which the conditions are FALSE or NA.
# Filter to keep all humans:
filter(sw, species == "Human")
#> # A tibble: 35 x 13
#> name height mass hair_color skin_color eye_color
#> <chr> <int> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 blond fair blue
#> 2 Darth Vader 202 136 none white yellow
#> 3 Leia Organa 150 49 brown light brown
#> 4 Owen Lars 178 120 brown, grey light blue
#> 5 Beru Whitesun lars 165 75 brown light blue
#> 6 Biggs Darklighter 183 84 black light brown
#> 7 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray
#> 8 Anakin Skywalker 188 84 blond fair blue
#> 9 Wilhuff Tarkin 180 NA auburn, grey fair blue
#> 10 Han Solo 180 80 brown fair brown
#> # ... with 25 more rows, and 7 more variables: birth_year <dbl>,
#> # gender <chr>, homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
# The same command using the pipe:
sw %>% # Note: %>% is NOT + (used in ggplot)
filter(species == "Human")
#> # A tibble: 35 x 13
#> name height mass hair_color skin_color eye_color
#> <chr> <int> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 blond fair blue
#> 2 Darth Vader 202 136 none white yellow
#> 3 Leia Organa 150 49 brown light brown
#> 4 Owen Lars 178 120 brown, grey light blue
#> 5 Beru Whitesun lars 165 75 brown light blue
#> 6 Biggs Darklighter 183 84 black light brown
#> 7 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray
#> 8 Anakin Skywalker 188 84 blond fair blue
#> 9 Wilhuff Tarkin 180 NA auburn, grey fair blue
#> 10 Han Solo 180 80 brown fair brown
#> # ... with 25 more rows, and 7 more variables: birth_year <dbl>,
#> # gender <chr>, homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
# Filter by multiple (additive) conditions:
sw %>%
filter(height > 180, mass <= 75) # tall and light individuals
#> # A tibble: 3 x 13
#> name height mass hair_color skin_color eye_color birth_year
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl>
#> 1 Jar Jar Binks 196 66 none orange orange 52
#> 2 Adi Gallia 184 50 none dark blue NA
#> 3 Wat Tambor 193 48 none green, grey unknown NA
#> # ... with 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#> # films <list>, vehicles <list>, starships <list>
# The same command using the logical operator (&):
sw %>%
filter(height > 180 & mass <= 75) # tall and light individuals
#> # A tibble: 3 x 13
#> name height mass hair_color skin_color eye_color birth_year
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl>
#> 1 Jar Jar Binks 196 66 none orange orange 52
#> 2 Adi Gallia 184 50 none dark blue NA
#> 3 Wat Tambor 193 48 none green, grey unknown NA
#> # ... with 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#> # films <list>, vehicles <list>, starships <list>
# Filter for a range of a specific variable:
sw %>%
filter(height >= 150, height <= 165) # (a) using height twice
#> # A tibble: 9 x 13
#> name height mass hair_color skin_color eye_color
#> <chr> <int> <dbl> <chr> <chr> <chr>
#> 1 Leia Organa 150 49 brown light brown
#> 2 Beru Whitesun lars 165 75 brown light blue
#> 3 Mon Mothma 150 NA auburn fair blue
#> 4 Nien Nunb 160 68 none grey black
#> 5 Shmi Skywalker 163 NA black fair brown
#> 6 Ben Quadinaros 163 65 none grey, green, yellow orange
#> 7 Cordé 157 NA brown light brown
#> 8 Dormé 165 NA brown light brown
#> 9 Padmé Amidala 165 45 brown light brown
#> # ... with 7 more variables: birth_year <dbl>, gender <chr>,
#> # homeworld <chr>, species <chr>, films <list>, vehicles <list>,
#> # starships <list>
sw %>%
filter(between(height, 150, 165)) # (b) using between(...)
#> # A tibble: 9 x 13
#> name height mass hair_color skin_color eye_color
#> <chr> <int> <dbl> <chr> <chr> <chr>
#> 1 Leia Organa 150 49 brown light brown
#> 2 Beru Whitesun lars 165 75 brown light blue
#> 3 Mon Mothma 150 NA auburn fair blue
#> 4 Nien Nunb 160 68 none grey black
#> 5 Shmi Skywalker 163 NA black fair brown
#> 6 Ben Quadinaros 163 65 none grey, green, yellow orange
#> 7 Cordé 157 NA brown light brown
#> 8 Dormé 165 NA brown light brown
#> 9 Padmé Amidala 165 45 brown light brown
#> # ... with 7 more variables: birth_year <dbl>, gender <chr>,
#> # homeworld <chr>, species <chr>, films <list>, vehicles <list>,
#> # starships <list>
# Filter by multiple (alternative) conditions:
sw %>%
filter(homeworld == "Kashyyyk" | skin_color == "green")
#> # A tibble: 8 x 13
#> name height mass hair_color skin_color eye_color
#> <chr> <int> <dbl> <chr> <chr> <chr>
#> 1 Chewbacca 228 112 brown unknown blue
#> 2 Greedo 173 74 <NA> green black
#> 3 Yoda 66 17 white green brown
#> 4 Bossk 190 113 none green red
#> 5 Rugor Nass 206 NA none green orange
#> 6 Kit Fisto 196 87 none green black
#> 7 Poggle the Lesser 183 80 none green yellow
#> 8 Tarfful 234 136 brown brown blue
#> # ... with 7 more variables: birth_year <dbl>, gender <chr>,
#> # homeworld <chr>, species <chr>, films <list>, vehicles <list>,
#> # starships <list>
# Filter cases with missing (NA) values on specific variables:
sw %>%
filter(is.na(gender))
#> # A tibble: 3 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 C-3PO 167 75 <NA> gold yellow 112 <NA>
#> 2 R2-D2 96 32 <NA> white, blue red 33 <NA>
#> 3 R5-D4 97 32 <NA> white, red red NA <NA>
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
# Filter cases with existing (non-NA) values on specific variables:
sw %>%
filter(!is.na(mass), !is.na(birth_year))
#> # A tibble: 36 x 13
#> name height mass hair_color skin_color eye_color
#> <chr> <int> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 blond fair blue
#> 2 C-3PO 167 75 <NA> gold yellow
#> 3 R2-D2 96 32 <NA> white, blue red
#> 4 Darth Vader 202 136 none white yellow
#> 5 Leia Organa 150 49 brown light brown
#> 6 Owen Lars 178 120 brown, grey light blue
#> 7 Beru Whitesun lars 165 75 brown light blue
#> 8 Biggs Darklighter 183 84 black light brown
#> 9 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray
#> 10 Anakin Skywalker 188 84 blond fair blue
#> # ... with 26 more rows, and 7 more variables: birth_year <dbl>,
#> # gender <chr>, homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
## Note: See
# ?dplyr::filter # for more help and examples.Note some details:
Separating multiple conditions by commas is the same as the logical AND (
&).Variable names are unquoted.
The comma between conditions or tests (
x, y, ...) means the same as&(logical AND), as each test results in a vector of Boolean values.Unlike in base R, rows for which the condition evaluates to
NAare dropped.Additional filter functions include
near()for testing numerical (near-)identity.
Practice: Use filter on sw to select very diverse or narrow subsets of individuals. For instance,
- which individual with blond hair and blue eyes has an unknown mass?
- of which species are individuals that are over 2m tall and have brown hair?
- which individuals from Tatooine are not male (but may be
NA)? - which individuals are neither male nor female OR heavier than 130kg?
3. select to select columns
Using select selects variables (columns) by their names or numbers:
# Select 4 specific variables (columns) of sw:
select(sw, name, species, birth_year, gender)
#> # A tibble: 87 x 4
#> name species birth_year gender
#> <chr> <chr> <dbl> <chr>
#> 1 Luke Skywalker Human 19.0 male
#> 2 C-3PO Droid 112.0 <NA>
#> 3 R2-D2 Droid 33.0 <NA>
#> 4 Darth Vader Human 41.9 male
#> 5 Leia Organa Human 19.0 female
#> 6 Owen Lars Human 52.0 male
#> 7 Beru Whitesun lars Human 47.0 female
#> 8 R5-D4 Droid NA <NA>
#> 9 Biggs Darklighter Human 24.0 male
#> 10 Obi-Wan Kenobi Human 57.0 male
#> # ... with 77 more rows
# The same when using the pipe:
sw %>% # Note: %>% is NOT + (used in ggplot)
select(name, species, birth_year, gender)
#> # A tibble: 87 x 4
#> name species birth_year gender
#> <chr> <chr> <dbl> <chr>
#> 1 Luke Skywalker Human 19.0 male
#> 2 C-3PO Droid 112.0 <NA>
#> 3 R2-D2 Droid 33.0 <NA>
#> 4 Darth Vader Human 41.9 male
#> 5 Leia Organa Human 19.0 female
#> 6 Owen Lars Human 52.0 male
#> 7 Beru Whitesun lars Human 47.0 female
#> 8 R5-D4 Droid NA <NA>
#> 9 Biggs Darklighter Human 24.0 male
#> 10 Obi-Wan Kenobi Human 57.0 male
#> # ... with 77 more rows
# The same when providing a vector of variable names:
sw %>%
select(c(name, species, birth_year, gender))
#> # A tibble: 87 x 4
#> name species birth_year gender
#> <chr> <chr> <dbl> <chr>
#> 1 Luke Skywalker Human 19.0 male
#> 2 C-3PO Droid 112.0 <NA>
#> 3 R2-D2 Droid 33.0 <NA>
#> 4 Darth Vader Human 41.9 male
#> 5 Leia Organa Human 19.0 female
#> 6 Owen Lars Human 52.0 male
#> 7 Beru Whitesun lars Human 47.0 female
#> 8 R5-D4 Droid NA <NA>
#> 9 Biggs Darklighter Human 24.0 male
#> 10 Obi-Wan Kenobi Human 57.0 male
#> # ... with 77 more rows
# The same when providing column numbers:
sw %>%
select(1, 10, 7, 8)
#> # A tibble: 87 x 4
#> name species birth_year gender
#> <chr> <chr> <dbl> <chr>
#> 1 Luke Skywalker Human 19.0 male
#> 2 C-3PO Droid 112.0 <NA>
#> 3 R2-D2 Droid 33.0 <NA>
#> 4 Darth Vader Human 41.9 male
#> 5 Leia Organa Human 19.0 female
#> 6 Owen Lars Human 52.0 male
#> 7 Beru Whitesun lars Human 47.0 female
#> 8 R5-D4 Droid NA <NA>
#> 9 Biggs Darklighter Human 24.0 male
#> 10 Obi-Wan Kenobi Human 57.0 male
#> # ... with 77 more rows
# The same when providing a vector of column numbers:
sw %>%
select(c(1, 10, 7, 8))
#> # A tibble: 87 x 4
#> name species birth_year gender
#> <chr> <chr> <dbl> <chr>
#> 1 Luke Skywalker Human 19.0 male
#> 2 C-3PO Droid 112.0 <NA>
#> 3 R2-D2 Droid 33.0 <NA>
#> 4 Darth Vader Human 41.9 male
#> 5 Leia Organa Human 19.0 female
#> 6 Owen Lars Human 52.0 male
#> 7 Beru Whitesun lars Human 47.0 female
#> 8 R5-D4 Droid NA <NA>
#> 9 Biggs Darklighter Human 24.0 male
#> 10 Obi-Wan Kenobi Human 57.0 male
#> # ... with 77 more rows
# Select ranges of variables with ":":
sw %>%
select(name:mass, films:starships)
#> # A tibble: 87 x 6
#> name height mass films vehicles starships
#> <chr> <int> <dbl> <list> <list> <list>
#> 1 Luke Skywalker 172 77 <chr [5]> <chr [2]> <chr [2]>
#> 2 C-3PO 167 75 <chr [6]> <chr [0]> <chr [0]>
#> 3 R2-D2 96 32 <chr [7]> <chr [0]> <chr [0]>
#> 4 Darth Vader 202 136 <chr [4]> <chr [0]> <chr [1]>
#> 5 Leia Organa 150 49 <chr [5]> <chr [1]> <chr [0]>
#> 6 Owen Lars 178 120 <chr [3]> <chr [0]> <chr [0]>
#> 7 Beru Whitesun lars 165 75 <chr [3]> <chr [0]> <chr [0]>
#> 8 R5-D4 97 32 <chr [1]> <chr [0]> <chr [0]>
#> 9 Biggs Darklighter 183 84 <chr [1]> <chr [0]> <chr [1]>
#> 10 Obi-Wan Kenobi 182 77 <chr [6]> <chr [1]> <chr [5]>
#> # ... with 77 more rows
# Select to re-order variables (columns) with everything():
sw %>%
select(species, name, gender, everything())
#> # A tibble: 87 x 13
#> species name gender height mass hair_color
#> <chr> <chr> <chr> <int> <dbl> <chr>
#> 1 Human Luke Skywalker male 172 77 blond
#> 2 Droid C-3PO <NA> 167 75 <NA>
#> 3 Droid R2-D2 <NA> 96 32 <NA>
#> 4 Human Darth Vader male 202 136 none
#> 5 Human Leia Organa female 150 49 brown
#> 6 Human Owen Lars male 178 120 brown, grey
#> 7 Human Beru Whitesun lars female 165 75 brown
#> 8 Droid R5-D4 <NA> 97 32 <NA>
#> 9 Human Biggs Darklighter male 183 84 black
#> 10 Human Obi-Wan Kenobi male 182 77 auburn, white
#> # ... with 77 more rows, and 7 more variables: skin_color <chr>,
#> # eye_color <chr>, birth_year <dbl>, homeworld <chr>, films <list>,
#> # vehicles <list>, starships <list>
# Select variables with helper functions:
sw %>%
select(starts_with("s"))
#> # A tibble: 87 x 3
#> skin_color species starships
#> <chr> <chr> <list>
#> 1 fair Human <chr [2]>
#> 2 gold Droid <chr [0]>
#> 3 white, blue Droid <chr [0]>
#> 4 white Human <chr [1]>
#> 5 light Human <chr [0]>
#> 6 light Human <chr [0]>
#> 7 light Human <chr [0]>
#> 8 white, red Droid <chr [0]>
#> 9 light Human <chr [1]>
#> 10 fair Human <chr [5]>
#> # ... with 77 more rows
sw %>%
select(ends_with("s"))
#> # A tibble: 87 x 5
#> mass species films vehicles starships
#> <dbl> <chr> <list> <list> <list>
#> 1 77 Human <chr [5]> <chr [2]> <chr [2]>
#> 2 75 Droid <chr [6]> <chr [0]> <chr [0]>
#> 3 32 Droid <chr [7]> <chr [0]> <chr [0]>
#> 4 136 Human <chr [4]> <chr [0]> <chr [1]>
#> 5 49 Human <chr [5]> <chr [1]> <chr [0]>
#> 6 120 Human <chr [3]> <chr [0]> <chr [0]>
#> 7 75 Human <chr [3]> <chr [0]> <chr [0]>
#> 8 32 Droid <chr [1]> <chr [0]> <chr [0]>
#> 9 84 Human <chr [1]> <chr [0]> <chr [1]>
#> 10 77 Human <chr [6]> <chr [1]> <chr [5]>
#> # ... with 77 more rows
sw %>%
select(contains("_"))
#> # A tibble: 87 x 4
#> hair_color skin_color eye_color birth_year
#> <chr> <chr> <chr> <dbl>
#> 1 blond fair blue 19.0
#> 2 <NA> gold yellow 112.0
#> 3 <NA> white, blue red 33.0
#> 4 none white yellow 41.9
#> 5 brown light brown 19.0
#> 6 brown, grey light blue 52.0
#> 7 brown light blue 47.0
#> 8 <NA> white, red red NA
#> 9 black light brown 24.0
#> 10 auburn, white fair blue-gray 57.0
#> # ... with 77 more rows
sw %>%
select(matches("or"))
#> # A tibble: 87 x 4
#> hair_color skin_color eye_color homeworld
#> <chr> <chr> <chr> <chr>
#> 1 blond fair blue Tatooine
#> 2 <NA> gold yellow Tatooine
#> 3 <NA> white, blue red Naboo
#> 4 none white yellow Tatooine
#> 5 brown light brown Alderaan
#> 6 brown, grey light blue Tatooine
#> 7 brown light blue Tatooine
#> 8 <NA> white, red red Tatooine
#> 9 black light brown Tatooine
#> 10 auburn, white fair blue-gray Stewjon
#> # ... with 77 more rows
# Renaming variables:
sw %>%
rename(creature = name, from_planet = homeworld)
#> # A tibble: 87 x 13
#> creature height mass hair_color skin_color eye_color
#> <chr> <int> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 blond fair blue
#> 2 C-3PO 167 75 <NA> gold yellow
#> 3 R2-D2 96 32 <NA> white, blue red
#> 4 Darth Vader 202 136 none white yellow
#> 5 Leia Organa 150 49 brown light brown
#> 6 Owen Lars 178 120 brown, grey light blue
#> 7 Beru Whitesun lars 165 75 brown light blue
#> 8 R5-D4 97 32 <NA> white, red red
#> 9 Biggs Darklighter 183 84 black light brown
#> 10 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray
#> # ... with 77 more rows, and 7 more variables: birth_year <dbl>,
#> # gender <chr>, from_planet <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
## Note: See
# ?dplyr::select # for more help and examples.
?dplyr::select_if # for more help and examples. Note some details:
selectworks both by specifying variable (column) names and by specifying column numbers.Variable names are unquoted.
The sequence of variable names (separated by commas) specifies the order of columns in the resulting tibble.
Selecting and adding
everything()allows re-ordering.Various helper functions (e.g.,
starts_with,ends_with,contains,matches,num_range) refer to (parts of) variable names.renamerenames specified variables (without quotes) and keeps all other variables.
Practice: Use select on sw to select and re-order specific subsets of variables (e.g., all variables starting with “h”, all even columns, all character variables, etc.).
4. mutate to compute new variables
Using mutate computes new variables (columns) from scratch or existing ones:
# Preparation: Save only a subset variables of sw as sws:
sws <- select(sw, name:mass, birth_year:species)
sws # => 87 cases (rows), but only 7 variables (columns)
#> # A tibble: 87 x 7
#> name height mass birth_year gender homeworld species
#> <chr> <int> <dbl> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 19.0 male Tatooine Human
#> 2 C-3PO 167 75 112.0 <NA> Tatooine Droid
#> 3 R2-D2 96 32 33.0 <NA> Naboo Droid
#> 4 Darth Vader 202 136 41.9 male Tatooine Human
#> 5 Leia Organa 150 49 19.0 female Alderaan Human
#> 6 Owen Lars 178 120 52.0 male Tatooine Human
#> 7 Beru Whitesun lars 165 75 47.0 female Tatooine Human
#> 8 R5-D4 97 32 NA <NA> Tatooine Droid
#> 9 Biggs Darklighter 183 84 24.0 male Tatooine Human
#> 10 Obi-Wan Kenobi 182 77 57.0 male Stewjon Human
#> # ... with 77 more rows
# Compute 2 new variables and add them to existing ones:
mutate(sws, id = 1:nrow(sw), height_feet = .032808399 * height)
#> # A tibble: 87 x 9
#> name height mass birth_year gender homeworld species
#> <chr> <int> <dbl> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 19.0 male Tatooine Human
#> 2 C-3PO 167 75 112.0 <NA> Tatooine Droid
#> 3 R2-D2 96 32 33.0 <NA> Naboo Droid
#> 4 Darth Vader 202 136 41.9 male Tatooine Human
#> 5 Leia Organa 150 49 19.0 female Alderaan Human
#> 6 Owen Lars 178 120 52.0 male Tatooine Human
#> 7 Beru Whitesun lars 165 75 47.0 female Tatooine Human
#> 8 R5-D4 97 32 NA <NA> Tatooine Droid
#> 9 Biggs Darklighter 183 84 24.0 male Tatooine Human
#> 10 Obi-Wan Kenobi 182 77 57.0 male Stewjon Human
#> # ... with 77 more rows, and 2 more variables: id <int>, height_feet <dbl>
# The same using the pipe:
sws %>%
mutate(id = 1:nrow(sw), height_feet = .032808399 * height)
#> # A tibble: 87 x 9
#> name height mass birth_year gender homeworld species
#> <chr> <int> <dbl> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 19.0 male Tatooine Human
#> 2 C-3PO 167 75 112.0 <NA> Tatooine Droid
#> 3 R2-D2 96 32 33.0 <NA> Naboo Droid
#> 4 Darth Vader 202 136 41.9 male Tatooine Human
#> 5 Leia Organa 150 49 19.0 female Alderaan Human
#> 6 Owen Lars 178 120 52.0 male Tatooine Human
#> 7 Beru Whitesun lars 165 75 47.0 female Tatooine Human
#> 8 R5-D4 97 32 NA <NA> Tatooine Droid
#> 9 Biggs Darklighter 183 84 24.0 male Tatooine Human
#> 10 Obi-Wan Kenobi 182 77 57.0 male Stewjon Human
#> # ... with 77 more rows, and 2 more variables: id <int>, height_feet <dbl>
# Transmute commputes and only keeps new variables:
sws %>%
transmute(id = 1:nrow(sw), height_feet = .032808399 * height)
#> # A tibble: 87 x 2
#> id height_feet
#> <int> <dbl>
#> 1 1 5.643045
#> 2 2 5.479003
#> 3 3 3.149606
#> 4 4 6.627297
#> 5 5 4.921260
#> 6 6 5.839895
#> 7 7 5.413386
#> 8 8 3.182415
#> 9 9 6.003937
#> 10 10 5.971129
#> # ... with 77 more rows
# Compute variables based on multiple others (including computed ones):
sws %>%
mutate(BMI = mass / ((height / 100) ^ 2), # compute body mass index (kg/m^2)
BMI_low = BMI < 18.5, # classify low BMI values
BMI_high = BMI > 30, # classify high BMI values
BMI_norm = !BMI_low & !BMI_high # classify normal BMI values
)
#> # A tibble: 87 x 11
#> name height mass birth_year gender homeworld species
#> <chr> <int> <dbl> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 19.0 male Tatooine Human
#> 2 C-3PO 167 75 112.0 <NA> Tatooine Droid
#> 3 R2-D2 96 32 33.0 <NA> Naboo Droid
#> 4 Darth Vader 202 136 41.9 male Tatooine Human
#> 5 Leia Organa 150 49 19.0 female Alderaan Human
#> 6 Owen Lars 178 120 52.0 male Tatooine Human
#> 7 Beru Whitesun lars 165 75 47.0 female Tatooine Human
#> 8 R5-D4 97 32 NA <NA> Tatooine Droid
#> 9 Biggs Darklighter 183 84 24.0 male Tatooine Human
#> 10 Obi-Wan Kenobi 182 77 57.0 male Stewjon Human
#> # ... with 77 more rows, and 4 more variables: BMI <dbl>, BMI_low <lgl>,
#> # BMI_high <lgl>, BMI_norm <lgl>
## Note: See
# ?dplyr::mutate # for more help and examples. Note some details:
mutatecomputes new variables (columns) and adds them to existing ones, whiletransmutedrops existing ones.Each
mutatecommand specifies a new variable name (without quotes), followed by=and a rule for computing the new variable from existing ones.Variable names are unquoted.
Multiple
mutatesteps are separated by commas, each of which creates a new variable.See http://r4ds.had.co.nz/transform.html#mutate-funs for useful functions for creating new variables.
Practice: Compute a new variable mass_pound from mass (in kg) and the age of each individual in sw relative to Yoda’s age. (Note that the variable birth_year is provided in years BBY, i.e., Before Battle of Yavin.)
5. summarise to compute summaries
summarise computes a function for a specified variable and collapses the values of the specified variable (i.e., the rows of a specified columns) to a single value. It provides many different summary statistics by itself, but is even more useful in combination with group_by (discussed next).
# Summarise allows computing a function for a variable (column):
summarise(sw, mn_mass = mean(mass, na.rm = TRUE)) # => 97.31 kg
#> # A tibble: 1 x 1
#> mn_mass
#> <dbl>
#> 1 97.31186
# The same using the pipe:
sw %>%
summarise(mn_mass = mean(mass, na.rm = TRUE)) # => 97.31 kg
#> # A tibble: 1 x 1
#> mn_mass
#> <dbl>
#> 1 97.31186
# Multiple summarise steps allow applying
# different functions for 1 dependent variable:
sw %>%
summarise(n_mass = sum(!is.na(mass)),
mn_mass = mean(mass, na.rm = TRUE),
md_mass = median(mass, na.rm = TRUE),
sd_mass = sd(mass, na.rm = TRUE),
max_mass = max(mass, na.rm = TRUE),
big_mass = any(mass > 1000)
)
#> # A tibble: 1 x 6
#> n_mass mn_mass md_mass sd_mass max_mass big_mass
#> <int> <dbl> <dbl> <dbl> <dbl> <lgl>
#> 1 59 97.31186 79 169.4572 1358 TRUE
# Multiple summarise steps also allow applying
# different functions to different dependent variables:
sw %>%
summarise(# Descriptives of height:
n_height = sum(!is.na(height)),
mn_height = mean(height, na.rm = TRUE),
sd_height = sd(height, na.rm = TRUE),
# Descriptives of mass:
n_mass = sum(!is.na(mass)),
mn_mass = mean(mass, na.rm = TRUE),
sd_mass = sd(mass, na.rm = TRUE),
# Counts of character variables:
n_names = n(),
n_species = n_distinct(species),
n_worlds = n_distinct(homeworld)
)
#> # A tibble: 1 x 9
#> n_height mn_height sd_height n_mass mn_mass sd_mass n_names n_species
#> <int> <dbl> <dbl> <int> <dbl> <dbl> <int> <int>
#> 1 81 174.358 34.77043 59 97.31186 169.4572 87 38
#> # ... with 1 more variables: n_worlds <int>
## Note: See
# ?dplyr::summarise # for more help and examples. Note some details:
summarisecollapses multiple values into one value and returns a new tibble with as many rows as values computed.Each
summarisestep specifies a new variable name (without quotes), followed by=, and a function for computing the new variable from existing ones.Multiple
summarisesteps are separated by commas.Variable names are unquoted.
See https://dplyr.tidyverse.org/reference/summarise.html for examples and useful functions in combination with
summarise.
Practice: Apply all summary functions mentioned in ?dplyr::summarise to the sw dataset.
6. group_by to aggregate variables
Using group_by does not change the data, but the unit of aggregation for other commands, which is very useful in combination with mutate and summarise.
# Grouping does not change the data, but lists its groups:
group_by(sws, species) # => 38 groups of species
#> # A tibble: 87 x 7
#> # Groups: species [38]
#> name height mass birth_year gender homeworld species
#> <chr> <int> <dbl> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 19.0 male Tatooine Human
#> 2 C-3PO 167 75 112.0 <NA> Tatooine Droid
#> 3 R2-D2 96 32 33.0 <NA> Naboo Droid
#> 4 Darth Vader 202 136 41.9 male Tatooine Human
#> 5 Leia Organa 150 49 19.0 female Alderaan Human
#> 6 Owen Lars 178 120 52.0 male Tatooine Human
#> 7 Beru Whitesun lars 165 75 47.0 female Tatooine Human
#> 8 R5-D4 97 32 NA <NA> Tatooine Droid
#> 9 Biggs Darklighter 183 84 24.0 male Tatooine Human
#> 10 Obi-Wan Kenobi 182 77 57.0 male Stewjon Human
#> # ... with 77 more rows
# The same using the pipe:
sws %>%
group_by(species) # => 38 groups of species
#> # A tibble: 87 x 7
#> # Groups: species [38]
#> name height mass birth_year gender homeworld species
#> <chr> <int> <dbl> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 19.0 male Tatooine Human
#> 2 C-3PO 167 75 112.0 <NA> Tatooine Droid
#> 3 R2-D2 96 32 33.0 <NA> Naboo Droid
#> 4 Darth Vader 202 136 41.9 male Tatooine Human
#> 5 Leia Organa 150 49 19.0 female Alderaan Human
#> 6 Owen Lars 178 120 52.0 male Tatooine Human
#> 7 Beru Whitesun lars 165 75 47.0 female Tatooine Human
#> 8 R5-D4 97 32 NA <NA> Tatooine Droid
#> 9 Biggs Darklighter 183 84 24.0 male Tatooine Human
#> 10 Obi-Wan Kenobi 182 77 57.0 male Stewjon Human
#> # ... with 77 more rows
# group_by is ineffective by itself, but very powerful
# (a) in combination with `mutate` and
# (b) in combination with `summarise`.
# ad (a):
# In combination with mutate and an aggregation function,
# group_by changes the unit of aggregation:
sws %>%
mutate(mn_height_1 = mean(height, na.rm = TRUE)) %>% # aggregates over ALL cases
group_by(species) %>%
mutate(mn_height_2 = mean(height, na.rm = TRUE)) %>% # aggregates over current group (species)
group_by(gender) %>%
mutate(mn_height_3 = mean(height, na.rm = TRUE)) %>% # aggregates over current group (gender)
group_by(name) %>%
mutate(mn_height_4 = mean(height, na.rm = TRUE)) # aggregates over current group (name)
#> # A tibble: 87 x 11
#> # Groups: name [87]
#> name height mass birth_year gender homeworld species
#> <chr> <int> <dbl> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 19.0 male Tatooine Human
#> 2 C-3PO 167 75 112.0 <NA> Tatooine Droid
#> 3 R2-D2 96 32 33.0 <NA> Naboo Droid
#> 4 Darth Vader 202 136 41.9 male Tatooine Human
#> 5 Leia Organa 150 49 19.0 female Alderaan Human
#> 6 Owen Lars 178 120 52.0 male Tatooine Human
#> 7 Beru Whitesun lars 165 75 47.0 female Tatooine Human
#> 8 R5-D4 97 32 NA <NA> Tatooine Droid
#> 9 Biggs Darklighter 183 84 24.0 male Tatooine Human
#> 10 Obi-Wan Kenobi 182 77 57.0 male Stewjon Human
#> # ... with 77 more rows, and 4 more variables: mn_height_1 <dbl>,
#> # mn_height_2 <dbl>, mn_height_3 <dbl>, mn_height_4 <dbl>
# ad (b):
# group_by is particularly useful in combination
# with summarise:
sws %>%
group_by(homeworld) %>%
summarise(count = n(),
mn_height = mean(height, na.rm = TRUE),
mn_mass = mean(mass, na.rm = TRUE)
)
#> # A tibble: 49 x 4
#> homeworld count mn_height mn_mass
#> <chr> <int> <dbl> <dbl>
#> 1 Alderaan 3 176.3333 64.0
#> 2 Aleen Minor 1 79.0000 15.0
#> 3 Bespin 1 175.0000 79.0
#> 4 Bestine IV 1 180.0000 110.0
#> 5 Cato Neimoidia 1 191.0000 90.0
#> 6 Cerea 1 198.0000 82.0
#> 7 Champala 1 196.0000 NaN
#> 8 Chandrila 1 150.0000 NaN
#> 9 Concord Dawn 1 183.0000 79.0
#> 10 Corellia 2 175.0000 78.5
#> # ... with 39 more rows
# Note that this pipe returns a new tibble,
# with 49 rows (= different levels of homeworld) and
# - 1 column of the group variable (homeworld) and
# - 3 columns of the 3 newly summarised variables.
# group_by used with multiple variables yields a tibble
# containing the combination of all variable levels:
sw %>%
group_by(hair_color, eye_color) # => 35 groups (combinations)
#> # A tibble: 87 x 13
#> # Groups: hair_color, eye_color [35]
#> name height mass hair_color skin_color eye_color
#> <chr> <int> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 blond fair blue
#> 2 C-3PO 167 75 <NA> gold yellow
#> 3 R2-D2 96 32 <NA> white, blue red
#> 4 Darth Vader 202 136 none white yellow
#> 5 Leia Organa 150 49 brown light brown
#> 6 Owen Lars 178 120 brown, grey light blue
#> 7 Beru Whitesun lars 165 75 brown light blue
#> 8 R5-D4 97 32 <NA> white, red red
#> 9 Biggs Darklighter 183 84 black light brown
#> 10 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray
#> # ... with 77 more rows, and 7 more variables: birth_year <dbl>,
#> # gender <chr>, homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
# Counting the frequency of cases in groups:
sw %>%
group_by(hair_color, eye_color) %>%
count() %>%
arrange(desc(n))
#> # A tibble: 35 x 3
#> # Groups: hair_color, eye_color [35]
#> hair_color eye_color n
#> <chr> <chr> <int>
#> 1 black brown 9
#> 2 brown brown 9
#> 3 none black 9
#> 4 brown blue 7
#> 5 none orange 7
#> 6 none yellow 6
#> 7 blond blue 3
#> 8 none blue 3
#> 9 none red 3
#> 10 black blue 2
#> # ... with 25 more rows
# The same using summarise:
sw %>%
group_by(hair_color, eye_color) %>%
summarise(n = n()) %>%
arrange(desc(n))
#> # A tibble: 35 x 3
#> # Groups: hair_color [13]
#> hair_color eye_color n
#> <chr> <chr> <int>
#> 1 black brown 9
#> 2 brown brown 9
#> 3 none black 9
#> 4 brown blue 7
#> 5 none orange 7
#> 6 none yellow 6
#> 7 blond blue 3
#> 8 none blue 3
#> 9 none red 3
#> 10 black blue 2
#> # ... with 25 more rows
## Note: See
# ?dplyr::group_by # for more help and examples. Note some details:
group_bychanges the unit of aggregation for other commands (mutateandsummarise).Variable names are unquoted.
When using
group_bywith multiple variables, they are separated by commas.Using
group_bywithmutateresults in a tibble that has the same number of cases (rows) as the original tibble. By contrast, usinggroup_bywithsummariseresults in a new tibble with all combinations of variable levels as its cases (rows).
Practice: Create some groups and compute descriptive statistics (n, mean, median, standard deviation, …) for some variables. For instance,
What is the number and mean height and mass of individuals from Tatooine by species and gender?
Which humans are more than 5cm taller then the average human overall?
Which humans are more than 5cm taller than the average human of their own gender?
Combining commands
The essential dplyr commands are quite simple by themselves, but form the basic verbs of a language for data manipulation. The commands become particularly powerful when they are combined into pipes (by using %>%). Stringing together several dplyr commands allows slicing and dicing data (tibbles or data frames) in a step-wise fashion to run non-trivial data analyses on the fly.
Practice: Tidyverse meets universe
Answer the following questions about the dplyr::starwars dataset by using pipes of essential dplyr commands:
a. Basics:
- Save the tibble
dplyr::starwarsasswand report its dimensions.
b. Missing values and known unknowns:
How many missing (
NA) values doesswcontain?Which individuals come from an unknown (missing)
homeworldbut have a knownbirth_yearor knownmass?
c. Gender issues:
How many humans are contained in
swoverall and by gender?How many and which individuals in
sware neither male nor female?Of which species in
swexist at least 2 different gender values?
d. Popular homes and heights:
From which
homeworlddo the most indidividuals (rows) come from?What is the mean
heightof all individuals with orange eyes from the most popular homeworld?
e. Size and mass issues:
Compute the median, mean, and standard deviation of
heightfor all droids.Compute the average height and mass by species and save the result as
h_m.Sort
h_mto list the 3 species with the smallest individuals (in terms of mean height).Sort
h_mto list the 3 species with the heaviest individuals (in terms of median mass).
f. Counting and arranging:
- How many individuals exist of the three most frequent (known) species?
g. Grouped mutates:
- Which individuals are more than 20% lighter than the average mass of individuals of their own homeworld?
# library(tidyverse)
# ?dplyr::starwars
## (a) Basic data properties: ----
sw <- dplyr::starwars
dim(sw) # => 87 rows (denoting individuals) x 13 columns (variables)
#> [1] 87 13
## (b) Missing data: -----
## (+) How many missing data points?
sum(is.na(sw)) # => 101 missing values.
#> [1] 101
# (+) Which individuals come from an unknown (missing) homeworld
# but have a known birth_year or mass?
sw %>%
filter(is.na(homeworld), !is.na(mass) | !is.na(birth_year))
#> # A tibble: 3 x 13
#> name height mass hair_color skin_color eye_color birth_year
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl>
#> 1 Yoda 66 17 white green brown 896
#> 2 IG-88 200 140 none metal red 15
#> 3 Qui-Gon Jinn 193 89 brown fair blue 92
#> # ... with 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#> # films <list>, vehicles <list>, starships <list>
## (x) Which variable (column) has the most missing values?
colSums(is.na(sw)) # => birth_year has 44 missing values
#> name height mass hair_color skin_color eye_color
#> 0 6 28 5 0 0
#> birth_year gender homeworld species films vehicles
#> 44 3 10 5 0 0
#> starships
#> 0
colMeans(is.na(sw)) # (amounting to 50.1% of all cases).
#> name height mass hair_color skin_color eye_color
#> 0.00000000 0.06896552 0.32183908 0.05747126 0.00000000 0.00000000
#> birth_year gender homeworld species films vehicles
#> 0.50574713 0.03448276 0.11494253 0.05747126 0.00000000 0.00000000
#> starships
#> 0.00000000
## (x) Replace all missing values of `hair_color` (in the variable `sw$hair_color`) by "bald":
# sw$hair_color[is.na(sw$hair_color)] <- "bald"
## (c) Gender issues: -----
# (+) How many humans are there of each gender?
sw %>%
filter(species == "Human") %>%
group_by(gender) %>%
count()
#> # A tibble: 2 x 2
#> # Groups: gender [2]
#> gender n
#> <chr> <int>
#> 1 female 9
#> 2 male 26
## Answer: 35 Humans in total: 9 females, 26 male.
# (+) How many and which individuals are neither male nor female?
sw %>%
filter(gender != "male", gender != "female")
#> # A tibble: 3 x 13
#> name height mass hair_color skin_color eye_color
#> <chr> <int> <dbl> <chr> <chr> <chr>
#> 1 Jabba Desilijic Tiure 175 1358 <NA> green-tan, brown orange
#> 2 IG-88 200 140 none metal red
#> 3 BB8 NA NA none none black
#> # ... with 7 more variables: birth_year <dbl>, gender <chr>,
#> # homeworld <chr>, species <chr>, films <list>, vehicles <list>,
#> # starships <list>
# (+) Of which species are there at least 2 different gender values?
sw %>%
group_by(species, gender) %>%
count() %>% # table shows species by gender:
group_by(species) %>% # Which species appear more than once in this table?
count() %>%
filter(nn > 1)
#> # A tibble: 5 x 2
#> # Groups: species [5]
#> species nn
#> <chr> <int>
#> 1 Droid 2
#> 2 Human 2
#> 3 Kaminoan 2
#> 4 Twi'lek 2
#> 5 <NA> 2
## (d) Homeworld issues: -----
# (+) Popular homes: From which homeworld do the most indidividuals (rows) come from?
sw %>%
group_by(homeworld) %>%
count() %>%
arrange(desc(n))
#> # A tibble: 49 x 2
#> # Groups: homeworld [49]
#> homeworld n
#> <chr> <int>
#> 1 Naboo 11
#> 2 Tatooine 10
#> 3 <NA> 10
#> 4 Alderaan 3
#> 5 Coruscant 3
#> 6 Kamino 3
#> 7 Corellia 2
#> 8 Kashyyyk 2
#> 9 Mirial 2
#> 10 Ryloth 2
#> # ... with 39 more rows
# => Naboo (with 11 individuals)
# (+) What is the mean height of all individuals with orange eyes from the most popular homeworld?
sw %>%
filter(homeworld == "Naboo", eye_color == "orange") %>%
summarise(n = n(),
mn_height = mean(height))
#> # A tibble: 1 x 2
#> n mn_height
#> <int> <dbl>
#> 1 3 208.6667
## Note:
sw %>% filter(eye_color == "orange") # => 8 individuals
#> # A tibble: 8 x 13
#> name height mass hair_color skin_color
#> <chr> <int> <dbl> <chr> <chr>
#> 1 Jabba Desilijic Tiure 175 1358 <NA> green-tan, brown
#> 2 Ackbar 180 83 none brown mottle
#> 3 Jar Jar Binks 196 66 none orange
#> 4 Roos Tarpals 224 82 none grey
#> 5 Rugor Nass 206 NA none green
#> 6 Sebulba 112 40 none grey, red
#> 7 Ben Quadinaros 163 65 none grey, green, yellow
#> 8 Saesee Tiin 188 NA none pale
#> # ... with 8 more variables: eye_color <chr>, birth_year <dbl>,
#> # gender <chr>, homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
# (+) What is the mass and homeworld of the smallest droid?
sw %>%
filter(species == "Droid") %>%
arrange(height)
#> # A tibble: 5 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 R2-D2 96 32 <NA> white, blue red 33 <NA>
#> 2 R5-D4 97 32 <NA> white, red red NA <NA>
#> 3 C-3PO 167 75 <NA> gold yellow 112 <NA>
#> 4 IG-88 200 140 none metal red 15 none
#> 5 BB8 NA NA none none black NA none
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
## (e) Size and mass: Group summaries: -----
# (+) Compute the median, mean, and standard deviation of `height` for all droids.
sw %>%
filter(species == "Droid") %>%
summarise(n = n(),
not_NA_h = sum(!is.na(height)),
md_height = median(height, na.rm = TRUE),
mn_height = mean(height, na.rm = TRUE),
sd_height = sd(height, na.rm = TRUE))
#> # A tibble: 1 x 5
#> n not_NA_h md_height mn_height sd_height
#> <int> <int> <dbl> <dbl> <dbl>
#> 1 5 4 132 140 52.00641
# (+) Compute the average height and mass by species and save the result as `h_m`:
h_m <- sw %>%
group_by(species) %>%
summarise(n = n(),
not_NA_h = sum(!is.na(height)),
mn_height = mean(height, na.rm = TRUE),
not_NA_m = sum(!is.na(mass)),
md_mass = median(mass, na.rm = TRUE)
)
h_m
#> # A tibble: 38 x 6
#> species n not_NA_h mn_height not_NA_m md_mass
#> <chr> <int> <int> <dbl> <int> <dbl>
#> 1 Aleena 1 1 79.0000 1 15.0
#> 2 Besalisk 1 1 198.0000 1 102.0
#> 3 Cerean 1 1 198.0000 1 82.0
#> 4 Chagrian 1 1 196.0000 0 NA
#> 5 Clawdite 1 1 168.0000 1 55.0
#> 6 Droid 5 4 140.0000 4 53.5
#> 7 Dug 1 1 112.0000 1 40.0
#> 8 Ewok 1 1 88.0000 1 20.0
#> 9 Geonosian 1 1 183.0000 1 80.0
#> 10 Gungan 3 3 208.6667 2 74.0
#> # ... with 28 more rows
# (+) Use `h_m` to list the 3 species with the smallest individuals (in terms of mean height)?
h_m %>% arrange(mn_height) %>% slice(1:3)
#> # A tibble: 3 x 6
#> species n not_NA_h mn_height not_NA_m md_mass
#> <chr> <int> <int> <dbl> <int> <dbl>
#> 1 Yoda's species 1 1 66 1 17
#> 2 Aleena 1 1 79 1 15
#> 3 Ewok 1 1 88 1 20
# (+) Use `h_m` to list the 3 species with the heaviest individuals (in terms of median mass)?
h_m %>% arrange(desc(md_mass)) %>% slice(1:3)
#> # A tibble: 3 x 6
#> species n not_NA_h mn_height not_NA_m md_mass
#> <chr> <int> <int> <dbl> <int> <dbl>
#> 1 Hutt 1 1 175 1 1358
#> 2 Kaleesh 1 1 216 1 159
#> 3 Wookiee 2 2 231 2 124
## (+) Other questions: -----
# (f) How many individuals come from the 3 most frequent (known) species?
sw %>%
group_by(species) %>%
count %>%
arrange(desc(n)) %>%
filter(n > 1)
#> # A tibble: 9 x 2
#> # Groups: species [9]
#> species n
#> <chr> <int>
#> 1 Human 35
#> 2 Droid 5
#> 3 <NA> 5
#> 4 Gungan 3
#> 5 Kaminoan 2
#> 6 Mirialan 2
#> 7 Twi'lek 2
#> 8 Wookiee 2
#> 9 Zabrak 2
# (g) Which individuals are more than 20% lighter (in terms of mass)
# than the average mass of individuals of their own homeworld?
sw %>%
select(name, homeworld, mass) %>%
group_by(homeworld) %>%
mutate(n_notNA_mass = sum(!is.na(mass)),
mn_mass = mean(mass, na.rm = TRUE),
lighter = mass < (mn_mass - (.20 * mn_mass))
) %>%
filter(lighter == TRUE)
#> # A tibble: 5 x 6
#> # Groups: homeworld [4]
#> name homeworld mass n_notNA_mass mn_mass lighter
#> <chr> <chr> <dbl> <int> <dbl> <lgl>
#> 1 R2-D2 Naboo 32 6 64.16667 TRUE
#> 2 Leia Organa Alderaan 49 2 64.00000 TRUE
#> 3 R5-D4 Tatooine 32 8 85.37500 TRUE
#> 4 Yoda <NA> 17 3 82.00000 TRUE
#> 5 Padmé Amidala Naboo 45 6 64.16667 TRUEMore on data transformation
For more details on dplyr,
- study
vignette("dplyr")and the documentation for?arrange,?filter,?select, etc. - study https://dplyr.tidyverse.org/ and its examples;
- see the cheat sheet on data transformation;
- read Chapter 5: Data transformation and complete its exercises.
Visualizing data
Creating good graphs is both an art and a craft. A transparent visualization of data can promote insights before and beyond any mathematical analysis or statistical test. However, creating good graphs requires a thorough understanding of the data, the visual properties of graphs, and the tools that allow turning data into graphs. One such tool is the package `ggplot2, which implements a so-called “grammer of graphics” for R.
In the following, we introduce some essential commands of ggplot2 in the context of examples. However, the ggplot2 package extends far beyond this modest introduction – it is an important pillar (and predecessor) of the tidyverse and implements a language for and philosophy of data visualisation.
See Chapter 3: Data visualization) and Chapter 7: Exploratory data analysis (EDA) and the links provided below for more detailed information.
Commands and examples
General structure of ggplot calls
A generic template for creating a graph with ggplot is:
# Generic ggplot template:
ggplot(data = <DATA>) +
<GEOM_fun>(mapping = aes(<MAPPING>), <arg_1 = val_1, ..., arg_n = val_n>) +
<FACET_fun> + # optional
<LOOK_GOOD_fun> # optional
# Minimal ggplot template:
ggplot(<DATA>) +
<GEOM_fun>(aes(<MAPPING>) The generic template includes the following parts:
<DATA>is a data frame or tibble that contains the data that is to be plotted.<GEOM_fun>is a function that maps data to a geometric object (“geom”) according to an aesthetic mapping that are specified inaes(<MAPPING>). (A “mapping” specifies what goes where.)- A geom’s visual appearance (e.g., colors, shapes, sizes, …) can be customized
- in the aesthetic mapping (when varying visual features according to data properties), or
- by setting its arguments to specific values in
<arg_1 = val_1, ..., arg_n = val_n>(when remaining constant).
An optional
<FACET_fun>splits a complex plot into multiple subplots.A sequence of optional
<LOOK_GOOD_fun>adjusts the visual features of plots (e.g., by adding themes, plot titles and labels, color scales, and coordinate systems).
Some examples that illustrate the use of these components are:
Histograms
A histogram counts how often specific values of one (typically continuous) variable occur in the data. This allows viewing the distribution of values for this variable:
library(ggplot2)
# Data: ------
# Using mpg data:
?ggplot2::mpg
mpg
#> # A tibble: 234 x 11
#> manufacturer model displ year cyl trans drv cty hwy fl
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>
#> 1 audi a4 1.80 1999 4 auto(l… f 18 29 p
#> 2 audi a4 1.80 1999 4 manual… f 21 29 p
#> 3 audi a4 2.00 2008 4 manual… f 20 31 p
#> 4 audi a4 2.00 2008 4 auto(a… f 21 30 p
#> 5 audi a4 2.80 1999 6 auto(l… f 16 26 p
#> 6 audi a4 2.80 1999 6 manual… f 18 26 p
#> 7 audi a4 3.10 2008 6 auto(a… f 18 27 p
#> 8 audi a4 quat… 1.80 1999 4 manual… 4 18 26 p
#> 9 audi a4 quat… 1.80 1999 4 auto(l… 4 16 25 p
#> 10 audi a4 quat… 2.00 2008 4 manual… 4 20 28 p
#> # ... with 224 more rows, and 1 more variable: class <chr>
# (A) Histogram: ------
# A minimal histogram:
hi1 <- ggplot(mpg, aes(x = cty)) + # set mappings for ALL geoms
geom_histogram(binwidth = 1)
hi1
# The same histogram:
hi1b <- ggplot(mpg) +
geom_histogram(aes(x = cty)) # set mappings for THIS geoms
hi1b
# (B) Adding aesthetics, labels and themes: ------
# Enhanced version of the same plot:
hi2 <- ggplot(mpg) +
geom_histogram(aes(x = cty), binwidth = 1, fill = "forestgreen", color = "black") +
labs(title = "Distribution of fuel economy in city environments",
x = "cty (miles per gallon)",
caption = "Data from ggplot2::mpg") +
theme_light()
hi2Scatterplots
A scatterplot shows a data point (observation) as a function of 2 (typically continuous) variables x and y. This allows judging the relationship between x and y in the data:
# (A) Scatterplot: ------
# A minimal scatterplot + reference line:
sp1 <- ggplot(mpg) +
geom_point(aes(x = cty, y = hwy)) +
geom_abline()
sp1Dealing with overplotting
A common issue with scatterplots is so-called overplotting: Multiple points appear on the same position.
Here are some ways of dealing with this issue:
jitteradds randomness to positions;
alphauses transparency to show frequency of positions;
geom_sizeallows mapping values (e.g., frequency) to object size;facet_wrapallows disentangling plots by levels of variables.
Some examples include:
## Dealing with overplotting: -----
# 1. One way of dealing with overplotting is
# adding randomness to point positions:
sp2 <- ggplot(mpg) +
geom_point(aes(x = cty, y = hwy), position = "jitter") +
geom_abline()
sp2
# 2. Another way of dealing with overplotting is
# using transparency (via setting alpha to < 1):
sp3 <- ggplot(mpg) +
geom_point(aes(x = cty, y = hwy), position = "identity",
pch = 21, fill = "steelblue", alpha = 1/4, size = 4) +
geom_abline(linetype = 2, color = "firebrick") # +
# geom_rug(aes(x = cty, y = hwy), position = "jitter", alpha = 1/4, size = 1)
sp3
# Adding labels and themes to plots:
sp4 <- sp3 + # use the plot defined above
labs(title = "Fuel economy on highway vs. city",
x = "City (miles per gallon)",
y = "Highway (miles per gallon)",
caption = "Data from ggplot2::mpg") +
# coord_fixed() +
theme_bw()
sp4
# (C) Grouping (by a categorical variable): ------
# Using facets to avoid overplotting:
sp5 <- ggplot(mpg) +
geom_point(aes(x = cty, y = hwy)) +
geom_abline() +
facet_wrap(~class) +
theme_bw()
sp5
# Grouping by color:
sp6 <- ggplot(mpg) +
geom_point(aes(x = cty, y = hwy, color = class),
position = "jitter", alpha = 1/2, size = 4) +
geom_abline(linetype = 2) +
theme_bw()
sp6
# Grouping by facets:
sp7 <- ggplot(mpg) +
geom_point(aes(x = cty, y = hwy),
position = "jitter", alpha = 1/2, size = 2) +
geom_abline(linetype = 2) +
facet_wrap(~class) +
theme_bw()
sp7See https://ggplot2.tidyverse.org/reference/ for more examples.
Note some details:
ggplotrequires data and maps independent variables to dimensions (e.g., the x- and y-axis) and dependent variables to geometric objects (called “geoms”). It typically assumes that the to-be-plotted<DATA>is in a table (data frame or tibble) in long format and contains independent variables as factors.The arguments
data =andmappings =can be omitted, but an aesthetic mappingaes(<MAPPING>)for at least one geom is needed.Different geoms can be combined, but their order matters (as later layers are printed on top of earlier ones).
When multiple geoms use the same mappings, their common
aes(<MAPPING>)can be moved into the initialggplotcall (behind<DATA>).In
ggplot, a sequence of commands is combined by+, rather than%>%.The visual appearance of plots are highly customizable (e.g., by supplying aesthetic arguments, speciying labels and legends, and applying pre-defined themes to plots).
Bar plots
Another common type of plot shows the values (across different levels of some variable as the height of bars. As this plot type can use both categorical or continuous variables, it turns out to be surprisingly complex to create good bar charts. To us get started, here are only a few examples:
Counts of cases
By default, geom_bar computes summary statistics of the data. When nothing else is specified, geom_bar counts the number or frequency of values (i.e., stat = "count") and maps this count to the y (i.e., y = ..count..):
library(ggplot2)
## Data:
ggplot2::mpg
#> # A tibble: 234 x 11
#> manufacturer model displ year cyl trans drv cty hwy fl
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>
#> 1 audi a4 1.80 1999 4 auto(l… f 18 29 p
#> 2 audi a4 1.80 1999 4 manual… f 21 29 p
#> 3 audi a4 2.00 2008 4 manual… f 20 31 p
#> 4 audi a4 2.00 2008 4 auto(a… f 21 30 p
#> 5 audi a4 2.80 1999 6 auto(l… f 16 26 p
#> 6 audi a4 2.80 1999 6 manual… f 18 26 p
#> 7 audi a4 3.10 2008 6 auto(a… f 18 27 p
#> 8 audi a4 quat… 1.80 1999 4 manual… 4 18 26 p
#> 9 audi a4 quat… 1.80 1999 4 auto(l… 4 16 25 p
#> 10 audi a4 quat… 2.00 2008 4 manual… 4 20 28 p
#> # ... with 224 more rows, and 1 more variable: class <chr>
# (1) Count number of cases by class:
ggplot(mpg) +
geom_bar(aes(x = class))
# (b) is the same as:
ggplot(mpg) +
geom_bar(aes(x = class, y = ..count..))
# (c) is the same as:
ggplot(mpg) +
geom_bar(aes(x = class), stat = "count")
# (d) is the same as:
ggplot(mpg) +
geom_bar(aes(x = class, y = ..count..), stat = "count")
# (e) pimped version:
ggplot(mpg) +
geom_bar(aes(x = class, fill = class),
# stat = "count",
color = "black") +
labs(title = "Counts of cars by class",
x = "Class of car", y = "Frequency") +
scale_fill_brewer(name = "Class:", palette = "Blues") +
theme_bw()Practice: Plot the number or frequency of cases in the mpg data by cyl (in at least 3 different ways).
Proportion of cases
An alternative to showing the count or frequency of cases is showing the corresponding proportion of cases:
library(ggplot2)
## Data:
ggplot2::mpg
#> # A tibble: 234 x 11
#> manufacturer model displ year cyl trans drv cty hwy fl
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>
#> 1 audi a4 1.80 1999 4 auto(l… f 18 29 p
#> 2 audi a4 1.80 1999 4 manual… f 21 29 p
#> 3 audi a4 2.00 2008 4 manual… f 20 31 p
#> 4 audi a4 2.00 2008 4 auto(a… f 21 30 p
#> 5 audi a4 2.80 1999 6 auto(l… f 16 26 p
#> 6 audi a4 2.80 1999 6 manual… f 18 26 p
#> 7 audi a4 3.10 2008 6 auto(a… f 18 27 p
#> 8 audi a4 quat… 1.80 1999 4 manual… 4 18 26 p
#> 9 audi a4 quat… 1.80 1999 4 auto(l… 4 16 25 p
#> 10 audi a4 quat… 2.00 2008 4 manual… 4 20 28 p
#> # ... with 224 more rows, and 1 more variable: class <chr>
# (1) Proportion of cases by class:
ggplot(mpg) +
geom_bar(aes(x = class, y = ..prop.., group = 1))
# is the same as:
ggplot(mpg) +
geom_bar(aes(x = class, y = ..count../sum(..count..)))Practice: Plot the proportion of cases in the mpg data by cyl (in at least 3 different ways).
Bar plots of existing values
A common difficulty occurs when the table to plot already contains the values to be shown as bars. As there is nothing to be computed in this case, we need to specify stat = "identity" for geom_bar (to override its default of stat = "count").
For instance, let’s plot a bar chart that shows the election data from the following tibble de:
| year | party | share |
|---|---|---|
| 2013 | CDU/CSU | 0.415 |
| 2013 | SPD | 0.257 |
| 2013 | Others | 0.328 |
| 2017 | CDU/CSU | 0.330 |
| 2017 | SPD | 0.205 |
| 2017 | Others | 0.465 |
- A version with 2 x 3 separate bars (using
position = "dodge"):
## Data: -----
de # => 6 x 3 tibble
#> # A tibble: 6 x 3
#> year party share
#> * <chr> <fct> <dbl>
#> 1 2013 CDU/CSU 0.415
#> 2 2013 SPD 0.257
#> 3 2013 Others 0.328
#> 4 2017 CDU/CSU 0.330
#> 5 2017 SPD 0.205
#> 6 2017 Others 0.465
## Note that year is of type character, which could be changed by:
# de$year <- parse_integer(de$year)
## (1) Bar chart with side-by-side bars (dodge): -----
## (a) minimal version:
bp_1 <- ggplot(de, aes(x = year, y = share, fill = party)) +
## (A) 3 bars per election (position = "dodge"):
geom_bar(stat = "identity", position = "dodge", color = "black") # 3 bars next to each other
bp_1
## (b) Version with text labels and customized colors:
bp_1 +
## pimping plot:
geom_text(aes(label = paste0(round(share * 100, 1), "%"), y = share + .01),
position = position_dodge(width = 1),
fontface = 2, color = "black") +
# Some set of high contrast colors:
scale_fill_manual(name = "Party:", values = c("black", "red3", "gold")) +
# Titles and labels:
labs(title = "Partial results of the German general elections 2013 and 2017",
x = "Year of election", y = "Share of votes",
caption = "Data from www.bundeswahlleiter.de.") +
# coord_flip() +
theme_bw()- A version with 2 bars with 3 segments (using
position = "stack"):
## Data: -----
de # => 6 x 3 tibble
#> # A tibble: 6 x 3
#> year party share
#> * <chr> <fct> <dbl>
#> 1 2013 CDU/CSU 0.415
#> 2 2013 SPD 0.257
#> 3 2013 Others 0.328
#> 4 2017 CDU/CSU 0.330
#> 5 2017 SPD 0.205
#> 6 2017 Others 0.465
## (2) Bar chart with stacked bars: -----
## (a) minimal version:
bp_2 <- ggplot(de, aes(x = year, y = share, fill = party)) +
## (B) 1 bar per election (position = "stack"):
geom_bar(stat = "identity", position = "stack") # 1 bar per election
bp_2
## (b) Version with text labels and customized colors:
bp_2 +
## Pimping plot:
geom_text(aes(label = paste0(round(share * 100, 1), "%")),
position = position_stack(vjust = .5),
color = rep(c("black", "white", "white"), 2),
fontface = 2) +
# Some set of high contrast colors:
scale_fill_manual(name = "Party:", values = c("black", "red3", "gold")) +
# Titles and labels:
labs(title = "Partial results of the German general elections 2013 and 2017",
x = "Year of election", y = "Share of votes",
caption = "Data from www.bundeswahlleiter.de.") +
# coord_flip() +
theme_classic()Bar plots with error bars
It is typically a good idea to show some measure of variability (e.g., the standard deviation, standard error, confidence interval, etc.) to any bar plots. There is an entire range of geoms that draw error bars:
## Create data to plot: -----
n_cat <- 6
set.seed(101)
data <- tibble(
name = LETTERS[1:n_cat],
value = sample(seq(25, 50), n_cat),
sd = rnorm(n = n_cat, mean = 0, sd = 8))
data
#> # A tibble: 6 x 3
#> name value sd
#> <chr> <int> <dbl>
#> 1 A 34 1.71
#> 2 B 26 2.49
#> 3 C 42 9.39
#> 4 D 40 4.95
#> 5 E 30 -0.902
#> 6 F 31 7.34
## Error bars: -----
## x-aesthetic only:
# (a) errorbar:
ggplot(data) +
geom_bar(aes(x = name, y = value), stat = "identity", fill = "steelblue") +
geom_errorbar(aes(x = name, ymin = value - sd, ymax = value + sd),
width = 0.4, color = "orange", alpha = 1, size = 1.0)
# (b) linerange:
ggplot(data) +
geom_bar(aes(x = name, y = value), stat = "identity", fill = "olivedrab3") +
geom_linerange(aes(x = name, ymin = value - sd, ymax = value + sd),
color = "firebrick", alpha = 1, size = 2.5)
## Additional y-aesthetic:
# (c) crossbar:
ggplot(data) +
geom_bar(aes(x = name, y = value), stat = "identity", fill = "tomato4") +
geom_crossbar(aes(x = name, y = value, ymin = value - sd, ymax = value + sd),
width = 0.3, color = "sienna1", alpha = 1, size = 1.0)
# (d) pointrange:
ggplot(data) +
geom_bar(aes(x = name, y = value), stat = "identity", fill = "burlywood4") +
geom_pointrange(aes(x = name, y = value, ymin = value - sd, ymax = value + sd),
color = "gold", alpha = 1.0, size = 1.2)More on barplots:
Drawing lines and curves
There are many types of lines. Here, we introduce some basic types.
- Straight and curved lines: When using lines to illustrate boundaries, limits, or trends in plots, we can add them by specifying their key parameters (e.g., their
intercept,slope, etc.).
# Draw some basic lines:
# (_) Draw empty plot canvas:
ggplot()
# (a) Draw basic lines (by linear equation):
ggplot() +
geom_abline(linetype = 2, color = "forestgreen") + # dotted diagnonal
geom_abline(intercept = 1/3, slope = 1/3) # y = .333 + .333 x # Note the absence of labels on axes!
# (b) Add vertical lines:
ggplot() +
geom_abline(linetype = 2, color = "forestgreen") +
geom_abline(intercept = 1/3, slope = 1/3) +
geom_hline(yintercept = .50, color = "firebrick") # horizontal line# Note: Labels on y-axis are added automatically.
# (c) Add horizontal lines:
ggplot() +
geom_abline(linetype = 2, color = "forestgreen") +
geom_abline(intercept = 1/3, slope = 1/3) +
geom_hline(yintercept = .50, color = "firebrick") +
geom_vline(xintercept = .75, color = "steelblue") # vertical line# Note: Labels on x-axis are added automatically.
# (d) Add line segments (with start and end points):
ggplot() +
geom_abline(linetype = 2, color = "forestgreen") +
geom_abline(intercept = 1/3, slope = 1/3) +
geom_hline(yintercept = .50, color = "firebrick") +
geom_vline(xintercept = .75, color = "steelblue") +
geom_segment(aes(x = 1/4, y = 1, xend = 1, yend = 1/4),
color = "gold", arrow = NULL) # line segment # Note: To draw arrows, replace NULL by an arrow specification like
# arrow(angle = 30, length = unit(0.5, "cm"), ends = "first", type = "closed")
# (e) Add curve (with start and end points):
ggplot() +
geom_abline(linetype = 2, color = "forestgreen") +
geom_abline(intercept = 1/3, slope = 1/3) +
geom_hline(yintercept = .50, color = "firebrick") +
geom_vline(xintercept = .75, color = "steelblue") +
geom_segment(aes(x = 1/4, y = 1, xend = 1, yend = 1/4),
color = "gold", arrow = NULL) +
geom_curve(aes(x = 1/3, y = 2/3, xend = 1, yend = 1/3),
color = "orange", curvature = -.6) # curve
# (+) Prettify plot:
ggplot() +
geom_abline(linetype = 2, color = "forestgreen") +
geom_abline(intercept = 1/3, slope = 1/3) +
geom_hline(yintercept = .50, color = "firebrick") +
geom_vline(xintercept = .75, color = "steelblue") +
geom_segment(aes(x = 1/4, y = 1, xend = 1, yend = 1/4),
color = "gold", arrow = NULL) +
geom_curve(aes(x = 1/3, y = 2/3, xend = 1, yend = 1/3),
color = "orange", curvature = -.6) +
labs(title = "Plotting basic lines",
x = "x-value", y = "y-value",
caption = "[ds4psy]") +
theme_bw()- Drawing functions: A more general approach to drawing lines is using functions that define the value of
yas a computation on some valuex:
## Drawing functions:
# (a) Define some functions:
fn0 <- function(x){x}
fn1 <- function(x){1/3 * x + 1/3}
fn2 <- function(x){x^2 - x}
fn3 <- function(x){-log(abs(x))}
fn4 <- function(x){2^x}
fn5 <- function(x){2 * sin(x)}
fn6 <- function(x){rnorm(x, mean = 0, sd = 1)}
# (b) Empty plotting canvas:
ggplot(data.frame(x = c(-10, 10)), aes(x = x)) # empty canvas from -10 < x < +10
# (c) Draw functions with stat_function():
ggplot(data.frame(x = c(-10, 10)), aes(x = x)) +
stat_function(fun = fn0, color = "black") +
stat_function(fun = fn1, color = "steelblue") +
stat_function(fun = fn2, color = "forestgreen") +
stat_function(fun = fn3, color = "firebrick") +
stat_function(fun = fn4, color = "gold") +
stat_function(fun = fn5, color = "orange") +
stat_function(fun = fn6, color = "grey50") +
## Prettify plot: ##
labs(title = "Plotting functions", caption = "[ds4psy]") +
coord_cartesian(xlim = c(-3, +3), ylim = c(-3, +3)) + # zoom in on plot region +
theme_bw() # use bw theme- Line plots of data: When we have grouped
data(e.g., some values measured repeatedly over time) it often makes sense to show their development as a line plot. For instance, imagine having taken the following measurements of 3 people over the days of 1 week:
| name | Mon | Tue | Wed | Thu | Fri | Sat | Sun |
|---|---|---|---|---|---|---|---|
| Adam | 2.5 | 3.6 | 3.8 | 4.2 | 4.4 | 2.8 | 3.2 |
| Beta | 3.3 | 2.9 | 3.0 | 2.1 | 2.3 | 2.5 | 3.9 |
| Civo | 4.2 | 4.8 | 4.0 | 3.1 | 3.9 | 3.7 | 2.1 |
We can easily define this data as a tibble (e.g., row-by-row, using the tribble command), but then encounter a problem: To use geom_line we need to define a mapping from some variable x to some variable y. However, we do not have an individual variable x here, but rather 7 values of x for every person (for different days of the week). To obtain a single variable that contains all dependent values for x, we need to re-format the data from wide to long format (see Chapter 12: Tidy data, which introduces the tidyr package).
# (a) Data tibble (in wide format):
tb <- tribble(
~name, ~Mon, ~Tue, ~Wed, ~Thu, ~Fri, ~Sat, ~Sun,
#-----|-----|-----|-----|-----|-----|-----|-----|
"Adam", 2.5, 3.6, 3.8, 4.2, 4.4, 2.8, 3.2,
"Beta", 3.3, 2.9, 3.0, 2.1, 2.3, 2.5, 3.9,
"Civo", 4.2, 4.8, 4.0, 3.1, 3.9, 3.7, 2.1
)
tb # print data (in wide format):
#> # A tibble: 3 x 8
#> name Mon Tue Wed Thu Fri Sat Sun
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Adam 2.5 3.6 3.8 4.2 4.4 2.8 3.2
#> 2 Beta 3.3 2.9 3.0 2.1 2.3 2.5 3.9
#> 3 Civo 4.2 4.8 4.0 3.1 3.9 3.7 2.1
# (b) Re-format from wide to long format (using tidyr commands):
tb_long <- tb %>%
gather(Mon:Sun, key = "day", value = "val") %>%
arrange(name)
tb_long # print data (in long format):
#> # A tibble: 21 x 3
#> name day val
#> <chr> <chr> <dbl>
#> 1 Adam Mon 2.5
#> 2 Adam Tue 3.6
#> 3 Adam Wed 3.8
#> 4 Adam Thu 4.2
#> 5 Adam Fri 4.4
#> 6 Adam Sat 2.8
#> 7 Adam Sun 3.2
#> 8 Beta Mon 3.3
#> 9 Beta Tue 2.9
#> 10 Beta Wed 3.0
#> # ... with 11 more rows
# (c) Line plot of tb_long:
ggplot(tb_long, aes(x = day, y = val, group = name, color = name)) +
geom_line(size = 1.0)
# However, note that x-axis labels are ordered alphabetically!
# The reason for this is that -- in tb_long -- day is a character variable.
# To fix this, we need to turn day into a factor with levels that match its values:
tb_long$day <- factor(tb_long$day, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
tb_long
#> # A tibble: 21 x 3
#> name day val
#> <chr> <fctr> <dbl>
#> 1 Adam Mon 2.5
#> 2 Adam Tue 3.6
#> 3 Adam Wed 3.8
#> 4 Adam Thu 4.2
#> 5 Adam Fri 4.4
#> 6 Adam Sat 2.8
#> 7 Adam Sun 3.2
#> 8 Beta Mon 3.3
#> 9 Beta Tue 2.9
#> 10 Beta Wed 3.0
#> # ... with 11 more rows
# (d) Repeat (c) with tb_long$day as factor:
ggplot(tb_long, aes(x = day, y = val, group = name, color = name)) +
geom_line(size = 1.0)
# (e) A prettier version of the same plot:
ggplot(tb_long, aes(x = day, y = val, group = name, color = name, shape = name)) +
geom_line(size = 1.0) +
geom_point(size = 2.5) +
labs(title = "Line plot of data",
x = "Day of week", y = "Measurement",
caption = "[ds4psy]") +
scale_color_brewer(palette = "Set1") +
theme_bw()Data exploration
This section summarizes some essential parts of Chapter 7: Exploratory data analysis (EDA).
Defining EDA
In the introduction to data visualization, we mentioned that creating good graphs is both an art and a craft. This implies that a recipe for creating good graphs involves three ingredients:
- a solid understanding of the data involved,
- the right set of tools to deal with data, and
- lots of dedicated practice in using these tools to solve concrete tasks.
This recipe can be extended beyond graphs, as a mixture of the same ingredients is needed for all aspects of data analysis. For instance, when obtaining and exploring a new dataset, it is both an art and a craft to quickly obtain a good understanding of its contents. Exploratory data analysis (EDA) is the process of getting a grasp of new data. Efficient and effective EDA requires combining commands on tibbles (tibble), data visualization (ggplot2), and data transformation (dplyr).
Basic questions
Getting a grasp of some data requires understanding two inter-related aspects:
Semantics: What is the meaning and functional role of the observations?
- What are the units of analysis (cases or observations)?
- What variables exist for each case/observation (e.g., multiple measures for each case)?
- What are relationships between observations (e.g., repeated measurements) or variables (e.g., correlations)?
- What are independent vs. dependent variables (of an experiment)?
Formats: What data types are contained in the data and how are they arranged?
- How is the data formatted (in rows vs. columns)?
- What types of variables (columns) exist?
- Is the data tidy? (See the definition in Chapter 12: Tidy data.)
Answering all these questions is often difficult and requires many small steps that analyze and transform a dataset.
In the following, we will illustrate the most common steps.
Typical steps
Here are some basic questions to answer whenever we get (load or create) a new data file:
- What are the dimensions of the data?
- What types of variables (columns) are involved?
- What are the cases or observations (rows)?
- What are the ranges, distributions, and unexpected values (e.g., missing data and outliers) of variables (columns)?
- What are the relationships between variables?
Dealing with missing data and outliers
ToDo
Plotting distributions and relations
Creating good graphs is both an art and a craft, but also allows a quick overview of an unknown set of data. The key to creating good graphs requires answering 2 sets of questions:
Knowing the number and type of variables to be plotted. This includes answering data-related questions like
- How many variables are there to plot?
- Are these variables categorical or continuous?
- Do some variables qualify (e.g., group) the values of others?
- How many variables are there to plot?
Knowing the intended type of plot. This includes answering functional questions like
- What is the purpose of this plot?
- What are possible plots for this purpose?
- Which of these would be the most appropriate plot?
Even when the questions of 1. and 2. are answered, creating good graphs with ggplot requires a lot of practice and many hours of trial-and-error experimentation.
Histograms
A histogram shows counts of the values of 1 (typically continuous) variable. This is useful for evaluating the distribution of the variable:
library(ggplot2)
# Create data:
tb <- tibble(iq = rnorm(n = 1000, mean = 100, sd = 15))
# Basic histogram:
ggplot(tb) +
geom_histogram(aes(x = iq), binwidth = 5)
# Pimped histogram:
ggplot(tb) +
geom_histogram(aes(x = iq), binwidth = 5,
fill = "gold", color = "black") +
labs(title = "Histogram", x = "IQ values", y = "Frequency in sample (n)",
caption = "[Using random iq data.]") +
theme_classic()More on histograms:
Scatterplots
A scatterplot shows relationship between 2 (typically continuous) variables:
# Data:
ir <- as_tibble(iris)
ir
#> # A tibble: 150 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.10 3.50 1.40 0.200 setosa
#> 2 4.90 3.00 1.40 0.200 setosa
#> 3 4.70 3.20 1.30 0.200 setosa
#> 4 4.60 3.10 1.50 0.200 setosa
#> 5 5.00 3.60 1.40 0.200 setosa
#> 6 5.40 3.90 1.70 0.400 setosa
#> 7 4.60 3.40 1.40 0.300 setosa
#> 8 5.00 3.40 1.50 0.200 setosa
#> 9 4.40 2.90 1.40 0.200 setosa
#> 10 4.90 3.10 1.50 0.100 setosa
#> # ... with 140 more rows
# Basic scatterplot:
ggplot(ir) +
geom_point(aes(x = Petal.Length, y = Petal.Width, color = Species, shape = Species))
# Using 3 different facets:
ggplot(ir) +
geom_point(aes(x = Petal.Length, y = Petal.Width, color = Species)) +
facet_wrap(~Species)
# Pimped scatterplot:
ggplot(ir) +
geom_point(aes(x = Petal.Length, y = Petal.Width, fill = Species), pch = 21, color = "black", size = 2, alpha = 1/2) +
facet_wrap(~Species) +
# coord_fixed() +
labs(title = "Scatterplot", x = "Length of petal", y = "Width of petal",
caption = "[Using iris data.]") +
theme_bw() +
theme(legend.position = "none")More on scatterplots:
Box plots
ToDo:
- show medians, quartiles, distribution, and outliers
Improving plots
Most default plots can be improved by fine-tuning their visual appearance. Popular levers for “pimping” plots include:
- colors, shapes and sizes can be set withing geoms (and are variable when inside
aes(...), or fixed when set outside). Using color often involves setting specific color scales;
- labels are essential for understanding plots:
labs(...)allows setting titles, captions, axis labels, etc.;
- legends can be crucial for understanding aesthetic mappings. They can be edited or (re-)moved;
- themes allow for a consistent look, can be selected and modified.
More on data visualization
- study
vignette("ggplot")and the documentation forggplotand various geoms (e.g.,geom_); - study https://ggplot2.tidyverse.org/reference/ and its examples;
- see the cheat sheet on data visualization;
- read Chapter 3: Data visualization and Chapter 7: Exploratory data analysis (EDA) and complete their exercises.
Tidy data
Chapter 12: Tidy data teaches a consistent way to organise tabular data. It introduces commands of the tidyr package, which is a core member of the tidyverse.
Tabular data
In R, rectangular data is often organized in tibbles or data frames. Importantly, each column is a vector (of a particular type) that contains the values of a variable. Thus, whereas every column must be of one type, every row can contain values of different variables and types.
The same set of data (values of variables) can be organised in many different ways. For instance, the following tables (or tibbles) all provide the number of TB cases documented by the World Health Organization in 3 countries (Afghanistan, Brazil, and China) in 2 years (1999 and 2000):
| country | year | cases | population |
|---|---|---|---|
| Afghanistan | 1999 | 745 | 19987071 |
| Afghanistan | 2000 | 2666 | 20595360 |
| Brazil | 1999 | 37737 | 172006362 |
| Brazil | 2000 | 80488 | 174504898 |
| China | 1999 | 212258 | 1272915272 |
| China | 2000 | 213766 | 1280428583 |
library(tidyverse)
## Example of the same data organised in 4 different ways:
# ?table1 # for semantics and source of data
tidyr::table1
#> # A tibble: 6 x 4
#> country year cases population
#> <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
tidyr::table2
#> # A tibble: 12 x 4
#> country year type count
#> <chr> <int> <chr> <int>
#> 1 Afghanistan 1999 cases 745
#> 2 Afghanistan 1999 population 19987071
#> 3 Afghanistan 2000 cases 2666
#> 4 Afghanistan 2000 population 20595360
#> 5 Brazil 1999 cases 37737
#> 6 Brazil 1999 population 172006362
#> 7 Brazil 2000 cases 80488
#> 8 Brazil 2000 population 174504898
#> 9 China 1999 cases 212258
#> 10 China 1999 population 1272915272
#> 11 China 2000 cases 213766
#> 12 China 2000 population 1280428583
tidyr::table3
#> # A tibble: 6 x 3
#> country year rate
#> * <chr> <int> <chr>
#> 1 Afghanistan 1999 745/19987071
#> 2 Afghanistan 2000 2666/20595360
#> 3 Brazil 1999 37737/172006362
#> 4 Brazil 2000 80488/174504898
#> 5 China 1999 212258/1272915272
#> 6 China 2000 213766/1280428583
tidyr::table4a
#> # A tibble: 3 x 3
#> country `1999` `2000`
#> * <chr> <int> <int>
#> 1 Afghanistan 745 2666
#> 2 Brazil 37737 80488
#> 3 China 212258 213766
tidyr::table4b
#> # A tibble: 3 x 3
#> country `1999` `2000`
#> * <chr> <int> <int>
#> 1 Afghanistan 19987071 20595360
#> 2 Brazil 172006362 174504898
#> 3 China 1272915272 1280428583
tidyr::table5
#> # A tibble: 6 x 4
#> country century year rate
#> * <chr> <chr> <chr> <chr>
#> 1 Afghanistan 19 99 745/19987071
#> 2 Afghanistan 20 00 2666/20595360
#> 3 Brazil 19 99 37737/172006362
#> 4 Brazil 20 00 80488/174504898
#> 5 China 19 99 212258/1272915272
#> 6 China 20 00 213766/1280428583Practice: Recreate the above bar plot using ggplot2 with data = table1.
Defining tidy data
Definition: A tidy dataset conforms to 3 interrelated rules:
Each variable must have its own column.
Each case/observation must have its own row.
Each value must have its own cell.
See http://r4ds.had.co.nz/tidy-data.html#fig:tidy-structure for a graphical illustration of these rules.
The 3 rules defining tidy data are connected, as it is impossible to only satisfy 2 of the 3. This leads to a simpler set of practical instructions for tidying a messy set of data:
- turn each dataset into a tibble.
- put each variable into a column.
Note that we need to interpret the semantics of the variables to understand whether a data set is tidy.
Practice: Which of the data tables in the above example (table1 to table5) are tidy? Why or why not?
Advantages of tidy data
Consistency: Consistent data structures make it easier to learn the tools that work with it because they have an underlying uniformity.
Vectorization: Placing variables in columns allows R’s vectorised nature to shine. For instance, the basic verbs of
dplyr(and most built-in R functions) work with vectors of values. That makes transforming tidy data easy and natural.Matching data and tools: Packages like
dplyr,ggplot2, and many others are designed to work with tidy data.
Commands and examples
We consider 2 pairs of 2 complementary commands as essential:
separatesplits 1 variable into 2 variables;
unitecombines 2 variables into 1 variable;
gathermakes wide data longer (by gathering many variables into 1);
spreadmakes long data wider (by spreading 1 variable into many).
separate is the complement/opposite of unite and spread is the complement/opposite of gather.
Here are some basic examples for using these 4 commands:
1. separate a variable
separate splits 1 variable (column) into multiple variables (columns) – at a position where some separator character appears – and is the complement to unite. Using separate requires the following arguments:
- some tibble/data frame
data; - the variable (column)
colto be separated (specified by its name or column number); - the names of the new variables (columns)
intowhichcolis to be split (specified as a character vector); - the separator character
sep(as a character/regular expression).
An additional argument remove regulates whether the original columns are dropped from the output tibble. By default, remove = TRUE.
# Data to use:
tidyr::table3 # Note that column rate contains 2 numbers, separated by "/".
#> # A tibble: 6 x 3
#> country year rate
#> * <chr> <int> <chr>
#> 1 Afghanistan 1999 745/19987071
#> 2 Afghanistan 2000 2666/20595360
#> 3 Brazil 1999 37737/172006362
#> 4 Brazil 2000 80488/174504898
#> 5 China 1999 212258/1272915272
#> 6 China 2000 213766/1280428583
## Basics: -----
# Full separate command:
separate(data = table3, col = rate, into = c("cases", "population"), sep = "/")
#> # A tibble: 6 x 4
#> country year cases population
#> * <chr> <int> <chr> <chr>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
# Note that "/" disappears from output tibble.
# Shorter versions of the same command:
separate(table3, rate, c("cases", "population"))
#> # A tibble: 6 x 4
#> country year cases population
#> * <chr> <int> <chr> <chr>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
# Using the pipe:
table3 %>%
separate(rate, c("cases", "population"))
#> # A tibble: 6 x 4
#> country year cases population
#> * <chr> <int> <chr> <chr>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
## Variants: -----
# Specifying the variable to be split (rate) by its column number (3):
table3 %>%
separate(3, c("cases", "population"))
#> # A tibble: 6 x 4
#> country year cases population
#> * <chr> <int> <chr> <chr>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
# Not dropping the original rate variable:
table3 %>%
separate(rate, c("cases", "population"), remove = FALSE)
#> # A tibble: 6 x 5
#> country year rate cases population
#> * <chr> <int> <chr> <chr> <chr>
#> 1 Afghanistan 1999 745/19987071 745 19987071
#> 2 Afghanistan 2000 2666/20595360 2666 20595360
#> 3 Brazil 1999 37737/172006362 37737 172006362
#> 4 Brazil 2000 80488/174504898 80488 174504898
#> 5 China 1999 212258/1272915272 212258 1272915272
#> 6 China 2000 213766/1280428583 213766 1280428583The example shows that the argument names (data, col, and into) can be left out (but still require appropriate arguments in the correct order) and sep can be left unspecified when tidyr can make a good guess what the separator character might be.
However, consider the following table6, which is available online and can be read into R via read_csv("http://rpository.com/ds4psy/data/table6.csv"):
## Load data (as comma-separated file):
table6 <- read_csv("http://rpository.com/ds4psy/data/table6.csv") # from online source
## Alternatively (from local source "data/table6.csv"):
# table6 <- read_csv("data/table6.csv") # from local directory
table6
#> # A tibble: 6 x 2
#> country when_what
#> <chr> <chr>
#> 1 Afghanistan 19_99.745/19987071
#> 2 Afghanistan 20_00.2666/20595360
#> 3 Brazil 19_99.37737/172006362
#> 4 Brazil 20_00.80488/174504898
#> 5 China 19_99.212258/1272915272
#> 6 China 20_00.213766/1280428583Here, the variable when_what contains several plausible separation characters: _, ., and /. Let’s first see what happens when we fail to provide a separating character sep, and then split the variable when_what in three different ways:
# Data to use:
table6 <- read_csv("http://rpository.com/ds4psy/data/table6.csv") # from online source
table6 # Note that column when_what contains several splitting options.
#> # A tibble: 6 x 2
#> country when_what
#> <chr> <chr>
#> 1 Afghanistan 19_99.745/19987071
#> 2 Afghanistan 20_00.2666/20595360
#> 3 Brazil 19_99.37737/172006362
#> 4 Brazil 20_00.80488/174504898
#> 5 China 19_99.212258/1272915272
#> 6 China 20_00.213766/1280428583
# What happens when we do not specify "sep"?
table6 %>%
separate(col = when_what, into = c("var_1", "var_2")) # sep is not provided!
#> # A tibble: 6 x 3
#> country var_1 var_2
#> * <chr> <chr> <chr>
#> 1 Afghanistan 19 99
#> 2 Afghanistan 20 00
#> 3 Brazil 19 99
#> 4 Brazil 20 00
#> 5 China 19 99
#> 6 China 20 00
# => when_what is split at 1st option (_), but Warning (and loss of data)!
# Specifying different splitting characters:
# (a) split at "_":
table6 %>%
separate(col = when_what, into = c("var_1", "var_2"), sep = "_") #
#> # A tibble: 6 x 3
#> country var_1 var_2
#> * <chr> <chr> <chr>
#> 1 Afghanistan 19 99.745/19987071
#> 2 Afghanistan 20 00.2666/20595360
#> 3 Brazil 19 99.37737/172006362
#> 4 Brazil 20 00.80488/174504898
#> 5 China 19 99.212258/1272915272
#> 6 China 20 00.213766/1280428583
# (b) split at "." (specified as a regular expression "\\."):
table6 %>%
separate(col = when_what, into = c("var_1", "var_2"), sep = "\\.")
#> # A tibble: 6 x 3
#> country var_1 var_2
#> * <chr> <chr> <chr>
#> 1 Afghanistan 19_99 745/19987071
#> 2 Afghanistan 20_00 2666/20595360
#> 3 Brazil 19_99 37737/172006362
#> 4 Brazil 20_00 80488/174504898
#> 5 China 19_99 212258/1272915272
#> 6 China 20_00 213766/1280428583
# (c) split at "/":
table6 %>%
separate(col = when_what, into = c("var_1", "var_2"), sep = "/")
#> # A tibble: 6 x 3
#> country var_1 var_2
#> * <chr> <chr> <chr>
#> 1 Afghanistan 19_99.745 19987071
#> 2 Afghanistan 20_00.2666 20595360
#> 3 Brazil 19_99.37737 172006362
#> 4 Brazil 20_00.80488 174504898
#> 5 China 19_99.212258 1272915272
#> 6 China 20_00.213766 1280428583Note that using the point or period (.) as a splitting character sep = "." would not work. Instead, we need to use the corresponding regular expression sep = "\\.". (See Chapter 14: Strings for details.)
Practice: Split the when_what variable of table6 3 times to create a tibble table6a that contains 5 variables (columns) and reasonable variable names:
#> # A tibble: 6 x 5
#> country century year cases population
#> * <chr> <chr> <chr> <chr> <chr>
#> 1 Afghanistan 19 99 745 19987071
#> 2 Afghanistan 20 00 2666 20595360
#> 3 Brazil 19 99 37737 172006362
#> 4 Brazil 20 00 80488 174504898
#> 5 China 19 99 212258 1272915272
#> 6 China 20 00 213766 1280428583
2. unite variables
unite combines 2 variables (columns) into 1 variable (column) – adding an optional separator character – and is the complement to separate. Using unite requires the following arguments:
- some tibble/data frame
data; - the name of the new compound variable (column)
col(specified as a character); - the names of the variables (columns) to be combined (specified by their names or column numbers);
- an optional separator character
sep(as a character/regular expression).
An additional argument remove regulates whether the original columns are dropped from the output tibble. By default, remove = TRUE.
# Data to use:
tidyr::table5 # Note that columns 2 and 3 contain 2 values (as characters!) that belong together.
#> # A tibble: 6 x 4
#> country century year rate
#> * <chr> <chr> <chr> <chr>
#> 1 Afghanistan 19 99 745/19987071
#> 2 Afghanistan 20 00 2666/20595360
#> 3 Brazil 19 99 37737/172006362
#> 4 Brazil 20 00 80488/174504898
#> 5 China 19 99 212258/1272915272
#> 6 China 20 00 213766/1280428583
## Basics: -----
# Full separate command:
unite(data = table5, col = "yr", century, year, sep = "")
#> # A tibble: 6 x 3
#> country yr rate
#> * <chr> <chr> <chr>
#> 1 Afghanistan 1999 745/19987071
#> 2 Afghanistan 2000 2666/20595360
#> 3 Brazil 1999 37737/172006362
#> 4 Brazil 2000 80488/174504898
#> 5 China 1999 212258/1272915272
#> 6 China 2000 213766/1280428583
# Note that century and year variables disappear from output tibble.
# Shorter versions of the same command:
unite(table5, "yr", century, year, sep = "")
#> # A tibble: 6 x 3
#> country yr rate
#> * <chr> <chr> <chr>
#> 1 Afghanistan 1999 745/19987071
#> 2 Afghanistan 2000 2666/20595360
#> 3 Brazil 1999 37737/172006362
#> 4 Brazil 2000 80488/174504898
#> 5 China 1999 212258/1272915272
#> 6 China 2000 213766/1280428583
# Using the pipe:
table5 %>%
unite("yr", century, year, sep = "")
#> # A tibble: 6 x 3
#> country yr rate
#> * <chr> <chr> <chr>
#> 1 Afghanistan 1999 745/19987071
#> 2 Afghanistan 2000 2666/20595360
#> 3 Brazil 1999 37737/172006362
#> 4 Brazil 2000 80488/174504898
#> 5 China 1999 212258/1272915272
#> 6 China 2000 213766/1280428583
## Variants: -----
# Providing a different separation character:
table5 %>%
unite("yr", century, year, sep = "<--|-->")
#> # A tibble: 6 x 3
#> country yr rate
#> * <chr> <chr> <chr>
#> 1 Afghanistan 19<--|-->99 745/19987071
#> 2 Afghanistan 20<--|-->00 2666/20595360
#> 3 Brazil 19<--|-->99 37737/172006362
#> 4 Brazil 20<--|-->00 80488/174504898
#> 5 China 19<--|-->99 212258/1272915272
#> 6 China 20<--|-->00 213766/1280428583
# Specifying the variables to be combined () by their column numbers (2 & 3):
table5 %>%
unite("yr", 2, 3, sep = "")
#> # A tibble: 6 x 3
#> country yr rate
#> * <chr> <chr> <chr>
#> 1 Afghanistan 1999 745/19987071
#> 2 Afghanistan 2000 2666/20595360
#> 3 Brazil 1999 37737/172006362
#> 4 Brazil 2000 80488/174504898
#> 5 China 1999 212258/1272915272
#> 6 China 2000 213766/1280428583
# Not dropping the original variables:
table5 %>%
unite("yr", century, year, sep = "", remove = FALSE)
#> # A tibble: 6 x 5
#> country yr century year rate
#> * <chr> <chr> <chr> <chr> <chr>
#> 1 Afghanistan 1999 19 99 745/19987071
#> 2 Afghanistan 2000 20 00 2666/20595360
#> 3 Brazil 1999 19 99 37737/172006362
#> 4 Brazil 2000 20 00 80488/174504898
#> 5 China 1999 19 99 212258/1272915272
#> 6 China 2000 20 00 213766/1280428583Practice: Take the data from dplyr::storms and unite the variables year, month, day into 1 variable date.
#> # A tibble: 6 x 11
#> name date hour lat long status category wind pressure ts_diameter
#> <chr> <chr> <dbl> <dbl> <dbl> <chr> <ord> <int> <int> <dbl>
#> 1 Amy 1975… 0 27.5 -79.0 tropi… -1 25 1013 NA
#> 2 Amy 1975… 6.00 28.5 -79.0 tropi… -1 25 1013 NA
#> 3 Amy 1975… 12.0 29.5 -79.0 tropi… -1 25 1013 NA
#> 4 Amy 1975… 18.0 30.5 -79.0 tropi… -1 25 1013 NA
#> 5 Amy 1975… 0 31.5 -78.8 tropi… -1 25 1012 NA
#> 6 Amy 1975… 6.00 32.4 -78.7 tropi… -1 25 1012 NA
#> # ... with 1 more variable: hu_diameter <dbl>
Practice: Read the data from read_csv("http://rpository.com/ds4psy/data/table7.csv") into a tibble table7 and inspect its dimension and contents.
Use multiple (4)
separatecommands to splittable7into a tibbletable7awith multiple (5) columns.Use multiple (4)
unitecommands ontable7ato re-create a tibbletable7bthat contains all data in 1 column.
Examples of table7 and possible solutions for table7a and table7b:
#> # A tibble: 6 x 1
#> where_when_what
#> <chr>
#> 1 "Afghanistan@19:99$745\\19987071"
#> 2 "Afghanistan@20:00$2666\\20595360"
#> 3 "Brazil@19:99$37737\\172006362"
#> 4 "Brazil@20:00$80488\\174504898"
#> 5 "China@19:99$212258\\1272915272"
#> 6 "China@20:00$213766\\1280428583"
#> # A tibble: 6 x 5
#> country century year rate population
#> * <chr> <chr> <chr> <chr> <chr>
#> 1 Afghanistan 19 99 745 19987071
#> 2 Afghanistan 20 00 2666 20595360
#> 3 Brazil 19 99 37737 172006362
#> 4 Brazil 20 00 80488 174504898
#> 5 China 19 99 212258 1272915272
#> 6 China 20 00 213766 1280428583
#> # A tibble: 6 x 1
#> where_when_what
#> * <chr>
#> 1 Afghanistan:1999_745/19987071
#> 2 Afghanistan:2000_2666/20595360
#> 3 Brazil:1999_37737/172006362
#> 4 Brazil:2000_80488/174504898
#> 5 China:1999_212258/1272915272
#> 6 China:2000_213766/1280428583
3. gather makes wide data longer
Gathering is the opposite of spreading and used when observations that are distributed over multiple columns should be contained in 1 variable (column). More specifically, gather moves the values of several variables (columns) into 1 column value and describes this value by the value of a new key variable. When gathering more than 2 variables, this reduces the number of columns by increasing the number of rows (i.e., makes a wide data set longer).2
Using gather requires the following arguments:
datais a data frame or tibble;
keyis the name of the variable that describes the values of the gathered columns (or name of the independent variable);
valueis the name of the variable that is contained in the gathered columns (or the name of the dependent variable);
...orvar_x:var_yis a list of variables (columns) to be gathered.
# ?gather # provides documentation
## Data to use:
table4a
#> # A tibble: 3 x 3
#> country `1999` `2000`
#> * <chr> <int> <int>
#> 1 Afghanistan 745 2666
#> 2 Brazil 37737 80488
#> 3 China 212258 213766
# Note that counts of cases is distributed over 2 variables (columns) for each country.
## Basics: -----
# gather 2 variables into 1 variable:
gather(data = table4a,
key = year, value = cases,
`1999`:`2000`)
#> # A tibble: 6 x 3
#> country year cases
#> <chr> <chr> <int>
#> 1 Afghanistan 1999 745
#> 2 Brazil 1999 37737
#> 3 China 1999 212258
#> 4 Afghanistan 2000 2666
#> 5 Brazil 2000 80488
#> 6 China 2000 213766
# The same command using the pipe:
table4a %>%
gather(key = year, value = cases,
`1999`:`2000`)
#> # A tibble: 6 x 3
#> country year cases
#> <chr> <chr> <int>
#> 1 Afghanistan 1999 745
#> 2 Brazil 1999 37737
#> 3 China 1999 212258
#> 4 Afghanistan 2000 2666
#> 5 Brazil 2000 80488
#> 6 China 2000 213766
## Variants: -----
# The same command with in different order of arguments:
table4a %>%
gather(`1999`:`2000`, key = year, value = cases)
#> # A tibble: 6 x 3
#> country year cases
#> <chr> <chr> <int>
#> 1 Afghanistan 1999 745
#> 2 Brazil 1999 37737
#> 3 China 1999 212258
#> 4 Afghanistan 2000 2666
#> 5 Brazil 2000 80488
#> 6 China 2000 213766
# The same command specifying the numbers of the columns to gather:
table4a %>%
gather(2:3, key = year, value = cases)
#> # A tibble: 6 x 3
#> country year cases
#> <chr> <chr> <int>
#> 1 Afghanistan 1999 745
#> 2 Brazil 1999 37737
#> 3 China 1999 212258
#> 4 Afghanistan 2000 2666
#> 5 Brazil 2000 80488
#> 6 China 2000 213766Note that year is of type character in the above example. If we wanted our key variable to be converted into a number (here: an integer), we can add the optional argument convert = TRUE:
## Default: convert = FALSE:
table4a %>%
gather(key = year, value = cases, `1999`:`2000`, convert = FALSE)
#> # A tibble: 6 x 3
#> country year cases
#> <chr> <chr> <int>
#> 1 Afghanistan 1999 745
#> 2 Brazil 1999 37737
#> 3 China 1999 212258
#> 4 Afghanistan 2000 2666
#> 5 Brazil 2000 80488
#> 6 China 2000 213766
# => year is a character vector.
## Converting year into an integer:
table4a %>%
gather(key = year, value = cases, `1999`:`2000`, convert = TRUE)
#> # A tibble: 6 x 3
#> country year cases
#> <chr> <int> <int>
#> 1 Afghanistan 1999 745
#> 2 Brazil 1999 37737
#> 3 China 1999 212258
#> 4 Afghanistan 2000 2666
#> 5 Brazil 2000 80488
#> 6 China 2000 213766
# => year is a vector of integers. Practice: Save the following data as a tibble de and then turn it into tidy data (by using gather to create a single variable share and listing the election year as an additional variable).
| party | share_2013 | share_2017 |
|---|---|---|
| CDU/CSU | 0.415 | 0.330 |
| SPD | 0.257 | 0.205 |
| Others | 0.328 | 0.465 |
## (a) Data saved as a tibble (see above):
de
#> # A tibble: 3 x 3
#> party share_2013 share_2017
#> <fct> <dbl> <dbl>
#> 1 CDU/CSU 0.415 0.330
#> 2 SPD 0.257 0.205
#> 3 Others 0.328 0.465
## (b) Converting de into a tidy data table:
de_2 <- de %>%
gather(share_2013:share_2017, key = "election", value = "share") %>%
separate(col = election, into = c("dummy", "year")) %>%
select(year, party, share)
de_2
#> # A tibble: 6 x 3
#> year party share
#> * <chr> <fct> <dbl>
#> 1 2013 CDU/CSU 0.415
#> 2 2013 SPD 0.257
#> 3 2013 Others 0.328
#> 4 2017 CDU/CSU 0.330
#> 5 2017 SPD 0.205
#> 6 2017 Others 0.4654. spread makes long data wider
Spreading is the opposite of gathering and used when an observation that should be in 1 row is distributed over multiple rows (in 1 column). More specifically, spread puts the values of several cases (rows) into different variables (columns) of 1 row. When spreading more than 2 rows per case, this decreases the number of rows by increasing the number of columns (i.e., makes a long data set wider).3
Using spread requires the following arguments:
datais a data frame or tibble;
keyis the name of the variable that describes the values of the gathered columns (or the names of the independent variables which become the names of the new columns);
valueis the name of the variable whose values should be spread over multiple columns (or the name of the dependent variable);
Note that we do not need to specify a range of new columns. The number of new columns is determined by the number of different values in the key variable.
# ?spread # provides documentation
## Data to use:
table2
#> # A tibble: 12 x 4
#> country year type count
#> <chr> <int> <chr> <int>
#> 1 Afghanistan 1999 cases 745
#> 2 Afghanistan 1999 population 19987071
#> 3 Afghanistan 2000 cases 2666
#> 4 Afghanistan 2000 population 20595360
#> 5 Brazil 1999 cases 37737
#> 6 Brazil 1999 population 172006362
#> 7 Brazil 2000 cases 80488
#> 8 Brazil 2000 population 174504898
#> 9 China 1999 cases 212258
#> 10 China 1999 population 1272915272
#> 11 China 2000 cases 213766
#> 12 China 2000 population 1280428583
# Note that count contains 2 DVs which are described by the values of type.
## Basics: -----
# spread 2 rows into 2 columns of 1 row:
spread(data = table2,
key = type, value = count)
#> # A tibble: 6 x 4
#> country year cases population
#> * <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
# The same command using the pipe:
table2 %>%
spread(key = type, value = count)
#> # A tibble: 6 x 4
#> country year cases population
#> * <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
# The same shorter:
table2 %>%
spread(type, count)
#> # A tibble: 6 x 4
#> country year cases population
#> * <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
## Variants: -----
# Use <key><sep><value> to create new column names:
table2 %>%
spread(key = type, value = count, sep = ":")
#> # A tibble: 6 x 4
#> country year `type:cases` `type:population`
#> * <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583Practice: Take the 6 x 3 tibble de_2 (from above) and use spread to create a 3 x 3 tibble de_3 that re-creates the original tibble de from it.
## (a) Data from above:
de_2
#> # A tibble: 6 x 3
#> year party share
#> * <chr> <fctr> <dbl>
#> 1 2013 CDU/CSU 0.415
#> 2 2013 SPD 0.257
#> 3 2013 Others 0.328
#> 4 2017 CDU/CSU 0.330
#> 5 2017 SPD 0.205
#> 6 2017 Others 0.465
## (b) Using spread to put share by year into 2 columns/variables:
de_3 <- de_2 %>%
spread(key = year, value = share) %>%
rename(share_2013 = `2013`, # restore original variable names
share_2017 = `2017`)
de_3
#> # A tibble: 3 x 3
#> party share_2013 share_2017
#> * <fctr> <dbl> <dbl>
#> 1 CDU/CSU 0.415 0.330
#> 2 SPD 0.257 0.205
#> 3 Others 0.328 0.465
## (c) Comparing de_3 to de:
de
#> # A tibble: 3 x 3
#> party share_2013 share_2017
#> <fctr> <dbl> <dbl>
#> 1 CDU/CSU 0.415 0.330
#> 2 SPD 0.257 0.205
#> 3 Others 0.328 0.465
all.equal(de_3, de)
#> [1] TRUEPractice: Moving stocks from wide to long to wide.
The following table shows the start and end price of 3 stocks on 3 days (d1, d2, d3):
| stock | d1_start | d1_end | d2_start | d2_end | d3_start | d3_end |
|---|---|---|---|---|---|---|
| Amada | 2.5 | 3.6 | 3.5 | 4.2 | 4.4 | 2.8 |
| Betix | 3.3 | 2.9 | 3.0 | 2.1 | 2.3 | 2.5 |
| Cevis | 4.2 | 4.8 | 4.6 | 3.1 | 3.2 | 3.7 |
a. Create a tibble st that contains this data in this (wide) format.
b. Transform st into a longer table st_long that contains 18 rows and only 1 numeric variable for all stock prices. Adjust this table so that the day and time appear as 2 separate columns.
c. Create a (line) graph that shows the 3 stocks’ end prices (on the y-axis) over the 3 days (on the x-axis).
d. Spread st_long into a wider table that contains start and end prices as 2 distinct variables (columns) for each stock and day.
# library(tidyverse)
## (a) Enter stock data (in wide format) as a tibble:
st <- tribble(
~stock, ~d1_start, ~d1_end, ~d2_start, ~d2_end, ~d3_start, ~d3_end,
#-----|----------|--------|----------|--------|----------|--------|
"Amada", 2.5, 3.6, 3.5, 4.2, 4.4, 2.8,
"Betix", 3.3, 2.9, 3.0, 2.1, 2.3, 2.5,
"Cevis", 4.2, 4.8, 4.6, 3.1, 3.2, 3.7
)
dim(st)
#> [1] 3 7
## Note data structure:
## 2 nested factors: day (1 to 3), type (start or end).
## (b) Change from wide to long format
## that contains the day (d1, d2, d3) and type (start vs. end) as separate columns:
st_long <- st %>%
gather(d1_start:d3_end, key = "key", value = "val") %>%
separate(key, into = c("day", "time")) %>%
arrange(stock, day, time) # optional: arrange rows
st_long
#> # A tibble: 18 x 4
#> stock day time val
#> <chr> <chr> <chr> <dbl>
#> 1 Amada d1 end 3.60
#> 2 Amada d1 start 2.50
#> 3 Amada d2 end 4.20
#> 4 Amada d2 start 3.50
#> 5 Amada d3 end 2.80
#> 6 Amada d3 start 4.40
#> 7 Betix d1 end 2.90
#> 8 Betix d1 start 3.30
#> 9 Betix d2 end 2.10
#> 10 Betix d2 start 3.00
#> 11 Betix d3 end 2.50
#> 12 Betix d3 start 2.30
#> 13 Cevis d1 end 4.80
#> 14 Cevis d1 start 4.20
#> 15 Cevis d2 end 3.10
#> 16 Cevis d2 start 4.60
#> 17 Cevis d3 end 3.70
#> 18 Cevis d3 start 3.20
## (c) Plot the end values (on the y-axis) of the 3 stocks over 3 days (x-axis):
st_long %>%
filter(time == "end") %>%
ggplot(aes(x = day, y = val, color = stock, shape = stock)) +
geom_point(size = 4) +
geom_line(aes(group = stock)) +
## Pimping plot:
labs(title = "End prices of stocks",
x = "Day", y = "End price",
shape = "Stock:", color = "Stock:") +
theme_bw()
## (d) Change st_long into a wider format that lists start and end as 2 distinct variables (columns):
st_long %>%
spread(key = time, value = val) %>%
mutate(day_nr = parse_integer(str_sub(day, 2, 2))) # optional: get day_nr as integer variable
#> # A tibble: 9 x 5
#> stock day end start day_nr
#> <chr> <chr> <dbl> <dbl> <int>
#> 1 Amada d1 3.60 2.50 1
#> 2 Amada d2 4.20 3.50 2
#> 3 Amada d3 2.80 4.40 3
#> 4 Betix d1 2.90 3.30 1
#> 5 Betix d2 2.10 3.00 2
#> 6 Betix d3 2.50 2.30 3
#> 7 Cevis d1 4.80 4.20 1
#> 8 Cevis d2 3.10 4.60 2
#> 9 Cevis d3 3.70 3.20 3More on tidy data
Study the vignette on
vignette("tidy-data")and the RStudio cheatsheet on Data Import for essentialtidyrcommands.Read Chapter 12: Tidy data and complete its exercises.
For background information on the notion of tidy data, see
Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10), 1–23.
available at http://www.jstatsoft.org/v59/i10/paper.Follow the links on https://tidyr.tidyverse.org. for additional information.
Conclusion
All ds4psy essentials:
| Nr. | Topic |
|---|---|
| 1. | Creating and using tibbles |
| 2. | Data transformation |
| 3. | Visualizing data |
| 4. | Exploring data |
| 5. | Tidy data |
[Last update on 2018-07-10 12:20:07 by hn.]
This is different in Sankey diagrams, shown https://developers.google.com/chart/interactive/docs/gallery/sankey.↩
The length and width of a data set are relative terms here: gathering tends to decrease data width by increasing length, spreading tends to decrease data length by increasing width.↩
Again, the length and width of data sets are relative terms.↩